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ABSTRACT 


This thesis presents a simulation and analysis of the Reduced Instruction Set 
Computer (RISC) architecture, and the effects on RISC performance of a lockup-free cache 
interface. RISC architectures achieve high performance by having a small, but sufficient, 
instruction set with most instructions executing in one clock cycle. Current RISC 
performance range from 1.5 to 2.0 CPI. The goal of RISC is to attain a CPI of 1.0. The 
major hinderance in attaining that goal is attributed to instructions that require main 
memory access. In this thesis, we attempt to reduce the effects of the high penalties for 
non-cache accesses by using a non-blocking cache memory subsystem called a lockup-free 
cache. This interface between the cache and main memory prevents the processor from 
“locking up" when a request from main memory occurs. This is accomplished by entering 
all non-cache requests into a memory request queue, while the processor continues to issue 
and execute other instructions. The evaluation of the effects of the lockup-free cache 
interface is done using different variations of the interface design. The results show that 


using the lockup-free cache improves the RISC performance. 
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I. INTRODUCTION 


A. COMPUTER TRENDS: The RISC Alternative 

The Reduced Instruction Set Computer (RISC) architecture 
is fast emerging as the architecture processor of choice in 
the computer industry. Since its arrival only a decade ago, 
numerous implementations and variations of RISC have emerged, 
and the trend is continuing to shift toward the RISC concept. 
Industry experts predict that the RISC architecture could 
capture a major share of the market in the 1990’s [Bur90]. 

Until recently, the ever increasing demands for faster and 
more powerful computing machinery have been met by the Complex 
Instruction Set Computer (CISC) architectures. As the name 
implies, these computers consist of powerful, complex 
instructions interpreted by microcode residing on a chip which 
controls the hardware that executes the program [Met90]. The 
sizes of the machine language instruction sets of CISC 
architectures become larger as they increase in complexity. 
The underlying assumption of CISC is that machines that 
feature many complicated instructions could provide more 
computing power for its users. Despite the advantages offered 
by the more complex instructions, the ideal performance of 
CISC is not achieved because of the overhead resulting from 


the complexity of the control circuits. 


RISC takes a radically different approach to improved 
performance. RISC architecture emphasizes simplicity and 
efficiency by having a small instruction set [Dei90]. Most 
RISC architectures are designed so that all instructions would 
execute in one cycle. This eliminated the more complex 
instructions that required more than one cycle to execute 
[SC91]. RISC architectures also avoid complicated 
instructions requiring microcode support. Instead, these 
complex capabilities are implemented in software [TT91]. 

Major characteristics of RISC architectures include a 
fewer number of instructions, simple load and store operations 
for register to memory transfers, large register set, deep 
pipelines, and many levels of memory hierarchy [GM87]. The 
most significant advantages of RISC include speed and ease of 


implementation. 


B. RISC PERFORMANCE THROUGH MEMORY HIERARCHY 

The performance of most computer architectures is often 
limited by the design of its memory hierarchy. Typically, 
memory iS managed using a three-level memory hierarchy. The 
first level is high speed cache, which is expensive and of 
lowest capacity. The second level is real or main memory 
which is slower and less expensive than cache memory. The 
third level is the large capacity storage devices such as 
disks. This level holds programs and data that cannot fit in 


levels one and two [FM87]. 


Main memory access delays are a major factor in 
performance of a program execution. With a typical miss 
penalty costing between 8 and 32 clock cycles [HP90], the 
ability to control and minimize access to main memory will 
have a direct effect on performance. thes 2s  pareroularly 
critical to the RISC goal of executing one instruction per 
cycle. 

RISC memory systems are usually complex because of the 
requirement to keep instructions and data supplied to the 
processors. The RISC memory hierarchy often includes an on- 
emap instruction buffer to hold the next few instructions. 
Some memory systems have both an instruction cache and a data 
cache which may be on or off-chip. The main memory for RISC 
systems are off-chip and sometimes off the processor board 
[GM87]. This maximizes the penalty for cache misses or other 
main memory accesses, making the requirement for highly 
efficient memory management systems critical. 

Improvements in RISC performance can likely be made 
through improvements in its memory management systems since 
memory accesses consume a considerable amount of machine 
cycles. Regardless of the efficiency or hit rate of a cache 
memory system, misses will occur and main memory must be 
accessed. Main memory access is also required for write/store 
mist Luce tons - Main memory access stalls or blocks the 
processor for a specified number of cycles while data is 


fetched and/or written. 


A possible solution to reducing the costs of main memory 
accesses is the concept of a lock-up free cache interface 
[Kro81] [SD91]. The lock-up free cache is a non-blocking cache 
interface that queues main memory access requests (i.e., loads 
and stores), allowing processing to continue while the memory 


access queue is being served. 


C. OBJECTIVES 
1. Primary Objective 

The primary objective of this thesis is to analyze the 
performance of different variations of the RISC architectural 
concept. Specifically, we examine the RISC architecture and 
the effects on performance of a memory subsystem known as a 
lockup-free cache interface. Experiments are made on models 
of several design possibilities of the lockup-free cache 
interface. 

In accomplishing the primary objective, an 
intermediate objective is to acquire or develop effective 
Simulation tools to observe the behavior of a RISC 
implementation as it executes different types of programs. 
We choose the SPARC as a model of a RISC architecture because 
SPARC incorporates many characteristics that are typical of 
RISC architectures, and a trace simulator for it was 


available. 


2. Simulation Tools Objectives 
One objective of the simulation tools is to produce 
executable SPARC binaries for input to a simulator which 
produces binary address trace files. These address traces are 
then used for producing instruction count data as shown in 
Figure 1.1 and for translating the binary address trace into 


a more readable SPARC assembler language format as shown in 


Pargure 1.2. 


Address >i Simulation >| Instruction 
Trace Tool Count Report 





Figure 1.1 Instruction Count Report Generation 


Address >|} Simulation 
Trace Tool 





Figure 1.2 Assembly Language Translation 


Another objective is to produce specially modified 
address trace files to use in other simulation tools to 
observe the RISC architecture under various workloads. They 
also provide a view for "what-if" analysis as varied 


architecture configurations are simulated. 


A final objective of the simulation tool is to provide 
functions for simulating the performance of a lockup-free 
cache interface for a RISC processor (Figure 1.3). The 
functions include simulating a non-blocking cache interface 
and fetching instructions out of order for execution. These 


specific techniques will be used to evaluate the effects of a 


cache interface that minimizes main memory traffic. 


Modified >| Performance 
Address Cache 


Trace Simulator 





Figure 1.3 Performance Analysis of Alternative Designs 


D. ORGANIZATION OF STUDY 

The remainder of this thesis is divided mee five 
chapters. In Chapter II background infermation on Jems 
SPARC, and the lockup-free cache is provided. Chapter III 
presents a model of a lockup-free cache. The simulation tools 
used to model the lockup-free cache interface and to observe 
the behavior of the SPARC architecture are discussed in 
Chapter IV. In Chapter V, we simulate and evaluate 
alternative design possibilities for the lockup-free cache 
interface on RISC to improve the system performance. Chapter 


VI presents our conclusions and further research issues. 


II. BACKGROUND 


This chapter discusses the origin and characteristics of 
the RISC architecture and how RISC achieves high levels of 
performance. We then focus on the SPARC architecture and how 
it approaches the RISC concept. Finally, the lockup-free 
cache interface design is introduced as it is modelled in this 


study to determine its effect on RISC performance. 


A. OVERVIEW OF THE REDUCED INSTRUCTION SET COMPUTER 
1. General 

The Reduced Instruction Set Computer (RISC) was 
developed as a result of studies in the mid 1970’s which 
suggested that computer architectures consisting of many 
complex IMigeruct Tons Still executed mostly simple 
instructions. Specifically, an IBM study observed that over 
two-thirds of the instruction executions on their System 370 
architecture accounted for only 10 simple instructions 
[De190]. In 1979 the first RISC machine, the IBM 801, was 
completed. The IBM 801 also was the first computer to feature 
Single-cycle instruction execution [SPA88]. 

The RISC architecture is based on the concept that 
computers with a relatively small number of = simple 
instructions and a large number of registers can operate 


faster than computers with a large instruction set containing 


many complex instructions. Figure 2.1 shows instruction set 
sizes of several RISC processors [Gro90][{GM87]. Although the 
name Reduced Instruction Set Computer implies reduced 
instruction sets, there is much more to a RISC architecture 
than that. The size of the instruction set is merely an end 
result of the techniques used to improve computer performance. 
Generally, RISC architectures are designed to exploit the 
advantages of the latest features of both hardware and 


software technologies. 


RISC PROCESSORS INSTRUCTIONS 














Figure 2.1 RISC Processors Instruction Set Sizes. 


2. Characteristics of RISC Architecture 

There are several specific characteristics that are 
typical of RISC architectures that have proven to be the key 
Eo Peneneed performance. One important characteristic is that 
all instructions except loads, stores, and floating point 
instructions can be executed in a single cycle. The single- 
cycle instruction set design makes it easier for several 
instructions to be processed at the same time, thus allowing 
more efficient pipeline operations. 

Another characteristic of RISC is its register 
intensive design. RISC machines have 32 or more general 
purpose registers, a feature that greatly reduces the number 
of operand memory references, thus reducing the costs of 
memory accesses [BEH91]. Generally, all RISC instructions 
use either two registers or a register and a constant with the 
result being placed in a destination register. The large 
number of registers can also be used to reduce the high cost 
of branch instructions by dedicating registers exclusively for 
branches [DW90]. 

RISC is also characterized by its simple fixed-format 
ist ructions. All instructions are 32 bits long and the 
operation codes and addresses are located in the same 
Besitions of an anstruction. To insure simplicity of the 
instruction set, RISC uses software designed from simple 
instructions to execute complex functions. Only those 


functions that do not degrade performance are implemented in 


hardware. The simple, fixed-format instruction set is also 
good for real-time environments because of its speed and ease 
of execution. 

Another characteristic is that RISC designs have a 
load/store architecture where all operations are performed on 
operands stored in registers with memory being accessed only 
by load and store instructions. The load/store architecture 
also makes it easier for compilers to optimize register 
allocation  [Kan87]. Figure 2.2 summarizes the basic 


characteristics of RISC and how it differs from the CISC 


[Met 90]. 


Instruction Set Small (< 100) Large (> 200) 


Instruction All instructions |Variable size 
Format 32 bits long instruction 


Memory Addressing Only load/store Nearly all 
instructions instructions 


Figure 2.2 How RISC Differs from CISC Architectures. 





3. RISC Pipelining 
With the goal of achieving an execution rate of one 
machine cycle per INstruceron, one technique RISC 


architectures use is pipelining. The simple fixed instruction 


ane 


formats make pipelining with RISC architectures very 
efficient. RISC pipelines are also designed to reduce the 
cycles lost to conditional branches incorrectly predicted. 
One benefit of pipelining is that it provides a way to 
start a new instruction before a previous one has been 
completed. Figure 2.3 shows a sequential process being done 
without the use of pipelining. To process the same task using 
a five-stage pipeline as shown in Figure 2.4, five different 
instructions may be processing at a time, and ideally, one 
instruction is completed every cycle [Ibb90]. Pipelining 
improves processor speed by reducing the average execution 


time per instruction throughput. 


eee st i 





Figure 2.3 A 5-Step Sequential Process. 


The RISC I pipeline consisted of only two stages, a 
fetch and an execute. The fetch stage, which brings the 
instruction in from memory, took about the same time as the 
execute stage, which actually performed the calculations and 
wrote the results back to memory. The RISC II added a third 


stage, write stage, which wrote the results from a destination 
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Figure 2.4 Pipelined Execution of a 5-Stage Process. 


register to memory at the appropriate time [GM87]. More 
recent RISC architectures use four or five stage pipelines. 
4. Computer Performance and The RISC Approach 
a. Measuring Performance 

Computer performance is measured by the amount of 
the time required to execute a program. Performance 
encompasses two types of time, elapsed time and CPU time. 
Elapsed time is the time required to execute a program from 
Stare Cio. fimagh It includes latency of input/output 
activities such as memory and disks accesses, and it includes 
overhead from the operating system, such as context switching 
[HP90]. CPU time consists of user CPU time which is the 
actual time the computer spends in the user program, and 


system CPU time which is the time the computer spends in the 


2 


operating system doing some task required by the user program. 
The number of clock cycles to execute an 
instruction (cycles per instruction, CPI) and the number of 
instructions a computer executes per second (millions of 
instructions per second, MIPS) are also good indicators of 
performance. CPI is calculated by knowing the number of clock 
cycles and the instruction count: 
Clock cycles for a program 
CPI = 
Ernst Huceronm ecount 
From this formula, clock cycles can be defined as CPI * 
mestruction count. MIPS, million instructions per second, can 
be calculated as such: 
Lise nets On Count. Clock rate 
es =) 
Execution time * 10° Cli ae. 
MIPS and CPI values can both be used to calculate program 
execution time by: 
1 

peogramn time = Instruction count * CPi * ————— 
Clock rate 
Observing the formulas above, improved performance, or reduced 
program execution time can be achieved by decreasing either 
the cycle time, the CPI, or the instruction count. 

b. The RISC Approach to High Performance 


RISC CPI values are typically between 1.5 and 2.0. 


They achieve this by defining simple instructions and by using 


tS 


sufficiently large cache memory systems that have low miss 
rates. Simple instructions imply more efficient pipeline 
operations. The low-miss-rate caches greatly influence RISC 
performance, as during a miss the controller must first fetch 
the instruction or data from main memory. This incurs a 
Significant increase in program execution time because 
numerous cycles are required to access main memory [TT91]. 

RISC reduces its instruction count through the use 
of a large number of registers. Variables, constants, and 
addresses are placed in registers instead of time-consuming 
main memory. The use of registers instead of memory for 
instructions other than loads and stores also reduces the 
requirement for memory access which could result in a cache 
miss [AAD90]. 

The cycle time is dependent mainly on available 
technology. The design of the cache and pipeline determine 
whether or not an architecture can achieve the aim of one 
instruction executed per cycle. RISC’s simple, fixed-length 
instructions allow fast chip-to-cache interfacing. The fixed 
formats also speeds up decoding and dependency calculations 


which helps shorten the cycle time [{Gar91]. 


B. SCALABLE PROCESSOR ARCHITECTURE (SPARC) 
The Scalable Processor Architecture (SPARC) is a Reduced 
Instruction Set Computer (RISC) developed by Sun Microsystems 


in 1987 [SPA88]. The SPARC architecture is based on the design 
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of the Berkeley RISC-II implementation [HP90]. The main 
features of SPARC, like most other RISCs, include a small, 
simple instruction set which directly enhances its 
performance. SPARC is an open architecture with published 
design specification. This allows standard products to be 
acquired from a more cost-effective vendor market as 
integrated circuits can be purchased from chip vendors, and 
software from software vendors. The primary objective of SPARC 
was to support the C programming language, numerical 
applications using FORTRAN, and artificial intelligence and 
expert system applications using Lisp and Prolog [RT88]. 
1. The SPARC Architecture 
The SPARC architecture consists of an integer unit 
(IU), a floating-point unit (FPU) configured around a 32-bit 
virtual address bus, and 32-bit instruction and data busses. 
The storage system includes a memory management unit and a 
cache system for both instructions and data. Figure 2.5 shows 
the arrangement and interaction between components of the 
architecture. Some implementations of SPARC also include a 
coprocessor (CP). The IU, FPU, and CP each has its own set of 
registers. 
a. The Instruction Unit (IV) 
The IU performs the basic processing for the SPARC 
architecture. It executes the logical, arithmetic (except 


floating-point), control transfer, memory reference, and 
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instruction and data bus 


main MMU 
memory cache 


Figure 2.5 SPARC Architecture Components Diagram 





multiprocessor INnseEruct one (except floating—-pownmte 
operations) . It can have between 40 and 520 general-purpose 
registers, depending on the implementation and register window 
configuration. In addition to the window registers, the IU 
includes the processor state register (PSR), the window 
invalid mask (WIM), the trap base register (TBR), the program 
counters (PC and NPC), and the multiply stop register. 
b. The Floating-point Unit (FPO) 

The FPU performs floating point operations 
concurrently with the IU. It has 32 floating-point registers. 
Double precision numbers occupy an even-odd pair of register, 
and extended precision values occupy four consecutive 
registers. The FPU uses a queue to place floating-point 
instructions until they are ready to be executed. While 
floating-point operations are executing, the IU also continues 


to execute instructions. The FPU registers are accessible 


IEG 


only by special memory load and store instructions. These 
instructions, called floating-point load/store instructions, 
are not FPU operations, but IU operations. The IU generates 
the address and the FPU recognizes and processes the floating- 
Pormteanstructions [Gar9l]. 

c. SPARC Registers and Register Windows 

The SPARC is characterized mainly by its register 
intensive design. The IU, FPU, and CP each have their own set 
of registers, all of which are 32-bits wide. The use of these 
registers reduces memory traffic which significantly speeds up 
program execution. SPARC further exploits the use of 
registers through a register windowing scheme. The 40 to 520 
registers available to the IU are made possible through the 
partitioning of the register set into 2 to 32 overlapping 
register windows [HP90]. The actual number of registers is 
implementation dependent. 

The primary purpose of the register windows is to 
facilitate more efficient parameter passing during the 
procedure calls of a program execution. During execution, a 
program may access 32 general-purpose registers: 8 ins, 8 
locals, and 8 outs belonging to each window, and 8 global 
windows. Figure 2.6 shows a design of register windows. The 
different windows are identified by the Current Window Pointer 


(CWP) which decrements during a procedure call to activate the 


le 


next window and increments at procedure exit to activate the 


previous window [Gar91]. 


locals 





Figure 2.6 SPARC Register Windows. 


As shown in Figure 2.6, 8 registers overlap each 
window. Registers R/[8] to R[15] of a procedure caller’s 
window become R[24] to R[31] after the call. R[{16] through 
R[{23] are unique registers to each window. Global register 
R[O] always contains the value 0, because it is the most 
frequently used constant and should be easily available at all 
times. The window registers are sometimes labeled I[0] to 
I[7] for R[{24] to R[31] respectively for in registers, L[O] to 


L[{7] for R[16] to R[23] for local registers, O/0] €o C/ 7] wees 


rs 


Pac jeeto RfI5] fLormeout regasters, and G/0} and G{[7] for the 
global registers R[0] to R[7]. 
Advantages of using register windows include 
Peductions in the number of load and store instructions and, 
consequently, a decrease in the number of cache misses. 
Register window operations are not without their drawbacks. 
When all windows are full and a procedure call occurs, an 
overflow occurs and the window trap handler must move 16 
registers into memory. An underflow occurs when a procedure 
return occurs and the windows are empty, causing the trap 
handler to move 16 registers from memory. The cost of an 
overflow and an underflow are about 60 cycles each [HP90]. 
2. The SPARC Instruction Set 
The SPARC instruction set consists of 55 basic integer 
aacels floating-point instructions. All instructions are 32- 


bits wide and are identified by one of three different 


instruction formats. There are five basic categories of 
SPARC instructions: (1) load and store instructions, (2) 
aiamehmetitc/logic/shift instructions, (3) control-transfer 


instructions, (4) read/write control register instructions, 
and (5) coprocessor operations [RT88]. 
a. SPARC Instruction Types 
(1) Load and Store Instructions. Load and store 
instructions are also called memory reference instructions as 


they are the only instructions that access memory. These 


te 


instructions use byte, halfword, word, and doubleword 
operands. The load and store instructions can also be used to 
access up to 256 different address spaces in the system by the 
use of an address space identifier (asi). Figure 2.7 shows 
two different load instructions and two different store 


instructions. 


[$g1+520], %gl 
[$06+94], %gl 


%07, [%07+140] 
$05, [%05+07] 





Figure 2.7 Sample SPARC Load and Store Instructions. 


The first instruction is a load single integer 
instruction, which moves a word from memory into register %gl. 
In this example, the memory location is denoted by the sum of 
contents of register %gl and the constant 94. The second 
instruction, the load doubleword, moves a doubleword from the 
memory location indicated to %gl. The store instruction in the 
example stores the value in %o7 into the memory address 
indicated by the sum of [%06+140]. The last example, a store 
halfword, moves the least significant halfword from %o5 to the 
memory location specified by the sum of contents of %o5 and 
67. 

(2) Arithmetic/Logic/Shift Instructions. The 
Arithmetic, logic, and shift instructions perform operations 


on two operands and put the results into a destination 
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register. The operands can be either constants or register 
contents. Figure 2.8 show examples of each of the three types 


Sf anstructions. 


add eT, 6g lL, s13 
Or %$g0,71,%04 


sll 00,2, %02 





Figure 2.8 Sample Arithmetic/Logic/Shift Instructions. 


The add instruction adds the contents of 
registers %17 and %gl, placing the result in $13. The (or 
instruction implements a bitwise logical operation on the 
contents of %g0 and the constant 71, placing the results in 
%04. The shift instruction, sll, shifts the value of the 
contents of %o00 by the number of bits indicated, 2, placing 
the result in %o02. 

(3) Control Transfer Instructions. Control transfer 
instructions consist of conditional and unconditional branch, 
jump, call, trap, and return from call instructions. These 
instructions changes the value of the program counter. Figure 
2.9 shows examples of the types of control transfer 


instructions. 
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Figure 2.9 Sample Control Transfer Instructions 


The branch instruction, bne, evaluates a 
condition code and the branch is taken if the condition is 
true. In this example the target address is the PC value plus 
4 (the address of the next instruction) times the value of 
$gl. The jmpl instruction causes a control transfer to the 
address indicated by the sum of %o7 and 8, placing the PC in 
the destination register *g0. The call and rett instructions 
direct a control transfer to the indicated memory address. 

(4) Special Registers Read/Write Instructions. 
These instructions read the contents or write new values to 
the four special registers defined by the SPARC: Processor 
State Register (PSR), Trap Base Register (TBR), Window Invalid 
Mask (WIM), and Y register which is used for 64-bit integer 
Mullen ola Catone, 

(5) Coprocessor Operations. These instructions 
perform floating-point calculations, as well as operations on 
floating-point registers. They also include instructions 


involving the optional coprocessor. 


Ze 











b. SPARC Instruction Formats 

Figure 2.10 shows the three types of instruction 
formats and the fields and bit positions for each format used 
by the SPARC. The bit ordering in the formats is little- 
endian’ and the byte ordering is big-endian’. SPARC 
east ructions have two basic addressing modes: 
register+tregister and register+signed-immediate. 

(1) Format 1 Instructions. Format 1 has a 30-bit 
displacement field for Call, and in certain situations, Branch 
instructions. A call may be made to a distant location ina 
Single instruction. 

(2) Format 2 Instructions. Format 2 supports Sethi 
(set high) and branch instructions. The Sethi instruction 
loads a 22-bit immediate value into the upper 22 bits of the 
destination register and clears its lower 10 bits. The 22-bit 
displacement field also accommodates a +8-Mbyte displacement 
ferme cOnNditional branch instructions. 

(3) Format 3 Instructions. Format 3 is used for 
the remaining SPARC instructions. It has fields for two 
source registers and a destination register. When the i bit 


+ Little-endian machines store words with the high- 


humbered bits as the most significant. For example, if the 
binary number 1000 were represented in litte-endian format, 1 
1s the high-ordered bit and the most significant bit, whereas. 
For big-endian representation, 1 would be the least 
Ssugnicticane bit. 

* Big-endian byte ordering stores the words with the 
high-number byte as the least significant. 
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Figure 2.10 SPARC Instruction Format. Courtesy of SUN 
Microsystems [SPA88]. 


is set (i=1), the 13-bit immediate field value is used instead 
of the second source register. The load and store 
instructions use the upper 8 bits of the immediate field as an 
extension to the opcode fields to define floating-point 
Instruct Tens. Unused values for opcodes are reserved for 


future expansion and designated unimplemented. 
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3. SPARC Pipelines 
The SPARC SF9O0OLOIU and CYC601 processors use a four- 
stage pipeline: fetch, decode, execute, and write. Each 
stage performs a subset of operations needed to complete the 
execution of an instruction as depicted in Figure 2.11. Each 


stage completes its operation in a given cycle. 


Fetch Stage 
Decode Stage 


Execute Stage 
Write Stage 





Figure 2.11 SPARC Pipelined Execution of Instructions. 


At the fetch stage, the address of the instruction is 
sent out and the instruction is brought into the pipeline. 
During the decode stage the source operands are read from the 
registers and passed to both the execution unit and the 
instruction unit for later processing. Also, at this stage 
the address of the next instruction is calculated. In the 


execute stage, arithmetic and logic operations are performed. 


Ze 


The results of these calculations are stored in temporary 
registers before they are written into the appropriate 
destination registers. The write stage of the pipeline writes 
the results in the register file, and the instruction is not 
finished executing [NA91]. 

The four-stage pipeline is illustrated in Figure 2.12. 
. Although it takes four cycles from start to finish of each 
individual instruction, after the initial instruction 
completes, an instruction is completed every cycle afterwards 
(ignoring pipeline hazards). Also notice that when the first 
instruction, I(1), is in the final stage of the pipe, 
instructions I(2), I(3), and I(4) have already entered the 


pipe and are being processed. 


Pre [oe [ 
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Figure 2.12 A SPARC Four-Stage Instruction Pipeline: Fetch 
(FET), Decode (DEC), Execute (EXE), Write (WRT). 


The SPARC B5000 uses a five-stage pipeline: fetch, 
decode, execute, memory, and write. The memory stage is 


located between the execute and write stages of the previous 


Zo 


SS 


ee 


pipeline example. The memory stage is used for those 
instructions that have memory references. This stage performs 
the data transfers after the execute stage generates the 
memory address. The write stage places the results data from 


the memory stage into the register file [ABMP91]. 


e.. THE LOCKUP-FREE CACHE INTERFACE 
1. General 

Although RISC has proven to be a high performance 
architecture, situations such as data dependencies between 
instructions, conditional branch instructions, and memory 
Seeess penalties prevent RISC from achieving the goal of one 
Meer uction per cycle. The high performance of RISC 
architectures is partially attributed to their use of high 
speed cache systems. One important performance criteria of a 
cache is to maximize the probability that the requested data 
1s present which is to attain a maximum hit ratio. Another 
Criteria is to ensure that data access time from the cache is 
minimal. Thus, cache design is a major issue in computer 
performance. The parameters that are targeted in designing 
more efficient caches include cache size, cache associativity, 
cache replacement policy, line size, and hardware prefetching 
[Por89]. Most cache memories have hit rates between 85% and 
95%, and cache memory access times are 5 to 10 times faster 


than main memory access times [LFK90]. 


we) 


Regardless of the hit rate of a cache memory system, 
amiss or a write instruction will require main memory access. 
Main memory access is a major factor in performance 
degradation of a computer system. Generally, there are two 
ways to reduce memory access penalties: (1) reducing the 
number of memory requests, and (2) reducing the average 
latency [Por89]. RISC approaches the first problem by 
efficient register allocation. The second problem must be 
solved by acquiring more memory bandwidth. 

To further reduce the adverse effect of a non-cache 
access on a RISC architecture, a cache-to-main memory 
subsystem, called a lockup-free cache interface is proposed as 
a possible solution. Such scheme was used by Kroft for a 
uniprocessor architecture [Kro81], and by Scheurich and Dubois 
for a multiprocessor architecture [SD91]. As the name 
implies, a lockup-free cache interface prevents non-cache data 
requests from "locking up" the processor. The processor is 
allowed to continue processing instructions while memory 
requests are being handled. Figure 2.13 illustrates a memory 


hierarchy that includes a lockup-free cache interface. 


2. The Lockup-Free Cache Interface Concept 
A lockup-free cache interface is a component of cache- 
based memory systems used to control access to main memory. 


The objective of the lockup-free cache interface is to 
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Figure 2.13 Memory Hierarchy with Lockup-Free Cache 
Interface. 


increase the effectiveness of cache-based memory systems by 
minimizing the penalty for main memory accesses [{Kro81]. The 
basic concept is to prevent the processor from freezing on 
Mnon-cache accesses. On RISC machines, main memory accesses 
are required for cache misses and for write Ree cee ions: 
3. Design Issues of Lockup-free Cache Interfaces 
a. Memory Request Queue 

A major design consideration of a lockup-free cache 
interface is the use of a waiting queue for main memory 
requests. During processing, when a cache miss or a write 
instruction is encountered, the request is placed in a queue 
for main memory requests. At the same time that memory 
requests are being served from the queue, the processor 
continues to issue new instructions until the memory request 
queue fills or an instruction is dependent on data in the 


memory request queue. If an issued instruction is dependent 
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on data in the memory queue, then the instruction is blocked 
or put on hold until the required data is available. 
b. Other Design Issues 

The effects of a lockup-free cache interface on 
RISC performance also depend on other important design issues. 
One design issue is whether to use a shared or separate memory 
request queue for misses and writes. Another issue is whether 
to use a queue for blocked instructions or to freeze the 
process when an instruction is dependent on a queued data 
request. The length of the queue for main memory requests is 
also a design issue that may determine the effects of the 


lockup-free interface. 
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III. A LOCKUP-FREE CACHE INTERFACE MODEL 


A. THE LOCKUP-FREE CACHE INTERFACE DESIGN 
1. General 

In presenting a model for a lockup-free cache 
interface, we do not attempt to define a specific cache memory 
design. We also assume that there is a separate cache for 
instructions and data, which is the case in some SPARC 
implementations. Thus, the effect of instruction misses is not 
considered as it is assumed to be insignificant. With the 
high hit rates of most cache systems, most main memory 
accesses are likely to be writes instead of instruction 
misses. A write-through policy is also assumed. 

2. The Major Components of the Cache Interface 

Figure 3.1 is an overview of the components needed to 
implement the lockup-free cache interface. The interface has 
two queues: Memory Access Queue (MAQ), and Blocked Instruction 
Queue (BIQ). The MAQ is used for storing read misses and 
writes. It may either be a FIFO or priority queue, or it may 
be a split queue configuration with reads and writes in 
separate MAQs. 

The BIQ holds target register numbers of the read 
instructions that are in the MAQ. The BIQ entries correspond 


to the read entries in the MAQ. Therefore, the BIQ may be a 
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Figure 3.1 Structure of the Lockup-Free Cache Interface 
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FIFO or a priority queue, depending on the MAQ. Figure 3.2 
illustrates typical entries in the MAQ. The main memory 
controller uses a Memory Ready (MR) signal and a Data Ready 


(DR) signal to communicate with the cache interface. 


address unused 


Figure 3.2 MAQ Formats for Reads and Writes 





B. OVERVIEW OF THE SYSTEM OPERATION 
1. Lockup-Free Cache Operation 
The cache operates like a typical cache as long as no 
read misses or writes are encountered. On a read miss, an 
entry is added to the MAQ and the destination register of the 
instruction is entered into the BIQ. On a write instruction, 
an entry is enqueued in the MAQ. 
The main memory controller sends a Memory Ready (MR) 
Signal to the cache interface indicating that another memory 
access can be initiated. When the MR signal is received by 
the cache interface, the next entry in the MAQ is dequeued and 
sent to the memory controller. The memory controller also 
sends a Data Ready (DR) signal to the interface indicating 


that the data access from main memory is ready to be loaded. 


oS 


Thus, an entry in the BIQ is then dequeued and the data loaded 
into the register. 
2. The Processor Operation Model 

The processor stalls on three conditions: (1) the 
instruction to be issued uses a register that is being used as 
a target by an instruction in the MAQ or main memory, (2) the 
instruction to be issued uses a register that is the target of 
a blocked load instruction, (3) and the MAQ fills up. When 
any of these situations occurs the processor stalls until the 
DR signal is received and the BIQ dequeues the appropriate 
register. The MAQ will continue to process requests until the 
target register causing the stall receives the required data. 

Using the lockup-free cache, the processor is assumed 
to be able to issue instructions before the previous ones 
complete. Thus, the instructions can complete out-of-order. 
This is generally the case with writes and read misses that 
must wait to be served by main memory. While these 
instructions are waiting for main memory service, the 
processor continues to fetch and execute instructions. The 
processor cannot issue an instruction that depends on a 


previous instruction that is currently blocked. 
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IV. SIMULATION TOOLS 


To provide the capabilities for an analysis and view of 
the lockup-free cache and the RISC architecture, three 
different simulation tools are used in this study. These 
tools allow us to observe the behaviors of several variations 
of the architecture. Additionally, these tools produce 
address traces of actual program executions. This allows for 
more accurate and realistic results from the modelled RISC 
architectures. Figure 4.1 illustrates the simulation 
environment for this research. 

The first simulator is the SPARC Performance Analyzer 
(SPA) 1.0. The SPA is used to produce address traces of 
programs and to provide instruction count data for those 
traces. A second simulator is an address trace translator 
which produces readable instruction records and modified 
address trace files for use in other simulation tools. The 
instruction records provide the user with information such as 
instruction and data addresses, binary representations, 
opcodes, and registers used. The third simulator is a lockup- 
free cache interface which simulates a cache-to-main memory 


subsystem used to reduce the cost of main memory access. 
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Figure 4.1 Simulation Environment 
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A. THE SPARC PERFORMANCE ANALYZER 1.0 SIMULATOR 
The SPARC Performance Analyzer (SPA) 1.0 is a package of! 
Simulation tools used to analyze the performance of programs 


executed on SPARC machines. SPA can simulate two different 
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SPARC implementations: the Cypress CY7C601 and the Fujitsu 
MB86901. Simulations can be run on SPARCstations or any 
machine using a Sun OS4 operating system. The SPA was 
developed by Gordon Irlam and made available to users via file 
transfer protocol (ftp). The specific version we used was 
ported from ftp.uu.net:/system/sun/spa-1.0.tar.Z. 

The SPA 1.0 consists of three major components: SPY, 
SPANNER, and SPOUT. The SPY is a tool that traces the 
execution of a program and produces an address trace file; the 
SPANNER is a tool that converts the address traces into 
instruction count files; and the SPOUT is a component that 
formats and displays the results of the instruction count. 

There are numerous other tools in the SPA package that 
support the three major components. These tools add to the 
flexibility of SPA by allowing the user to set various 
parameters of the architecture and determining the effects on 
performance. Appendix A provides additional information on 


the major components of SPA and their uses. 


B. THE SPARC ADDRESS TRACE TRANSLATOR/ANALYZER 
1. General 
The SPARC Address Trace Translator/Analyzer (SATTA) is 
a program that takes as input the address trace files 
generated by SPA and translates them into detailed readable 


instruction records. The SATTA also generates SPARC assembly 


37 


language files and specially modified address traces to be 
used in the lockup-free cache interface simulator. 
2. Instruction Address Trace Format 

The composition of each instruction record in the 
address trace file is illustrated in Figure 4.2. The 
execution trapped (et) field is a one that indicates the 
execution status of the instruction. A OO means the 
instruction was executed, anda 1 indicates that it was not 
executed. The data address valid (dav) field is a one-~ 
character field, O or 1, indicating whether or not the data 
address field is valid. The data address field is valid if 


the instruction is a load or Sere iInSstreuetlon 


SEPUCE Sins emerson af 
char et; 
Ghar «dav; 
shoe tn- 
unsigned long op; 


unsigned long ia; 
unsigned long da; 





Figure 4.2 Trace Instruction Format 


The op field contains the integer value of the actual 
SPARC instruction. The instruction address (ia) field is the 
address in memory where the instruction was referenced or 
fetched, and the data address (da) field indicates the memory 


location of the referenced or target data. 
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3. The SATTA Instruction Record 
The SATTA program translates each address trace record 
into a very detailed, more readable display of information. 
A sample record is shown in Figure 4.3. This record includes 
all the information generated by the SPY component. In this 
example we see that the instruction was executed and that the 


value indicated in the data address field is valid. 


record: 1 

exec status: 0 

valid addr: 1 

trap no.: 0 

instruction: -805166984 

binary representation: 11010000000000100010000001111000 
geeLield: 3 


opcode value: 0 
opcode: ld 

rd value: 8 

rsl value: 8 
index bit: 1 
simm13: 120 

inst addr: 26b74 
data addr: 26c78 





Figure 4.3 Detailed Expansion of an Instruction Record 


The value in the instruction field is the SPARC 
instruction in integer form. This integer value is somewhat 
vague to the user as displayed. However, the binary 
representation field provides a more visual means for 
determining the components of the instruction. 

The 32-bit binary representation field is matched to 
the SPARC instruction format templates to determine the type 


of instruction, the opcode, registers used, and displacement 
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values. The op field value is the operation type. The 
operation type also determines the instruction format type. 
The opcode value field is the integer value of the opcode 
within the operation type. This value is translated into the 
actual opcode. The rd value is the destination register; rsl 
value is the source register; index bit indicates whether or 
not index addressing is used; and simm13 is a signed integer 
value used in calculating an immediate address. The 
instruction and data address values are translated directly 
from the trace file. Other instruction formats may contain 
different fields such as rs2 for a second source register, or 
immediate address. 
4. SATTA File Generators , 
In addition to producing detailed instruction record 
translations, the SATTA program also generates various types 
of files. These files may be used for additional tracing, 
further analysis, or as input files to other programs and 
Simulators. 
a. SPARC Assembly Language Files 
One type of file generated by the SATTA is an 
assembly language program. The file is produced from 
information taken directly from the translated instruction 
record. Figure 3.4 1s an excerpt from an assembly language 


file generated by SATTA. This capability allows the user to 
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see the assembly language equivalent of the traced program or 


Gommanually trace parts of the program. 


S35 nop 

84; bvs 8 

55° Ore Sle? 7626, 54 7 
S6: sethi 8, %00 

BR or SOWR 39, 500 
co. earl. a5 

Bi: or Si poo Ll 
90: ba 77200 

91: Oe 6$g0,5,%g1 
ae ta $g0,0,%00 


Figure 4.4 Assembly Language Code Produced by SATTA 


The assembly language file produced by SATTA can 
also be used as input to other simulators to demonstrate other 
features of RISC. Of particular use, the assembly language 
file can be used in Roc ene hae other RISC architecture 
components, such as pipelines, caches, or proposed add-ons. 
The instructions are in the standard SPARC assembly language 
syntax. The file is stored in ASCII format, thus can be 
easily used by other programs on most any type of machine. 

b. Cache Address Trace Files 

Cache address trace files are specially tailored 
files for use by the lockup-free cache interface simulator. 
The files consist of only that information from instruction 
records required to sufficiently simulate the cache interface. 
The use of only pertinent information speeds up the 


Simulation. All data included in the files is in hexadecimal 
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FOr: Figure 4.5 shows records from a cache trace file. 
Although the cache address trace files were created 
specifically for the lockup-free cache interface simulator, 
they can also be used as address trace input files for other 


cache simulators. 


Code Address 


0000213c 
ETELEIEO 
00002140 
0O000000b 
00002148 
COUOZ1T 70 
EE? Leelee 
00002178 
F7FLLId4 
O0002Z17c 


Z 
0 
2 
2 
3 
2 
0 
2 
it 
Z 





Figure 4.5 Records from Cache Address Trace File 


Each instruction generates a cache interface 
record. Each record is assigned a code of 2, except for 
branch instructions which has a code 3. The instruction 
address and the source and destination registers (Rsl, Rs2, 
and Rd) are also part of the cache address trace records. 

All load and store instructions generate an 
additional cache interface record. The records generated by 
loads are given a code of 0. The memory address of the needed 
data, along with the target register for the load operation is 


included in the additional load record. Similarly, the 
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additional records generated by the store instructions contain 
the memory address of where the data is to be stored, and the 
register number containing the data. The code for the store 
Bnstructions is 1. 

The different code types are used by the lockup- 
free cache interface to determine the number of cycles 
required to execute each type of instruction. Branch, load, 
and store instructions all generally require more than one 
cycle to execute. The cache interface simulator sets the 


Simulated number of cycles required for these instructions. 


C. THE RISC CACHE INTERFACE SIMULATOR 
1. General 
The RISC Cache Interface Simulator (RICIS) is a 
simulation tool that models a program executing on a RISC 
machine using a lockup-free cache interface. The primary 
objective of the RICIS is to calculate the performance of the 
RISC using the interface. The simulator is event-driven and 
uses the modified address trace files produced by SATTA as the 
input program. The results from the RICIS Simulation is 
compared to the results from running the same program with the 
SPA simulator to determine the effects of the design. 
2. The RICIS Program 
The RICIS is designed to simulate several different 


configurations of a lockup-free cache interface. It can be 
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easily modified to simulate even more design alternatives and 
to perform various statistical functions. RICIS can simulate 
large program executions since the traces have been modified 
to consist of only a few characters of information per 
instruction, and the trace instructions are discarded after 
they are processed. Therefore, although the traces generated 
by SPA and SATTA consume a considerable amount of disk space, 
RICIS can run most simulations without requiring additional 
disk space. 
3. The RICIS Operation 
a. Assumptions and Constraints 

The assumptions and constraints of the RICIS are as- 
follows: 

(1) Fibatine none instructions. RICIS does not 
simulate floating point instructions. Floating point 
ineeruceions are handled the same as integer instructions and 
are assumed to execute in a single cycle. Although this 
differs significantly from reality, this constraint is 
consistent with the SPA constraint. Therefore, comparing 
results produced by the two simulators using floating-point 
instructions should not present a problem. 

(2) Simulating cache hits and misses. There is 
currently no cache simulator available to determine if a load 
instruction is a cache hit or miss, thus the determination of 


a load hit or miss is simulated using a random number 
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generator. The user determines the hit ratio to be simulated. 
Once a load instruction is encountered, the random number 
generator produces a number between 0.0 and 100.0. If the 
generated number is greater than the hit ratio entered by the 
user, the load is considered a cache miss, otherwise a hit. 
We realize that cache hits and misses are not random, but this 
feature should at least produce the same percentage of hits. 

(3) Instruction types. The input to the RICIS is 
the modified address trace produced by SATTA. The RICIS does 
not need to distinguish between instruction opcodes. Thus, 
all instructions of the address are categorized into four 
Beferer ent types: loads, stores, branches, and others. All 
instructions of each instruction type are assumed to execute 
in the same amount of time. Basically, the RICIS needs to 
know whether an instruction is a memory instruction or if it 
requires more than one cycle to execute. 

b. Setting Simulation Parameters 

To run the RICIS program, the user enters the 
command RICIS. The program then prompts the user to enter the 
name of the address trace file and to set the parameters of 
the lockup-free cache to be simulated. In setting the 
parameters for the simulation, RICIS offers a variety of 
design options for simulating a lockup-free cache interface. 
One parameter choice is the simulated cache hit ratio. The 


user may enter a percentage value from 0.0 to 100.0. 


45 


After entering the cache hit ratio, the user must 
specify the MAQ configuration. The choices are FIFO and 
priority. If priority queue is chosen, the user has a choice 
between simulating a single queue for writes and read misses, 
or a separate queue for each type of MAQ entries. The user 
then sets the length of the MAQ. 

Another parameter the user must set is the main 
memory access penalty (in number of cycles) for cache misses 
and store instructions. The users may also set as a parameter 
the number of cycles to delay for branch instructions and for 
load dependency situations. A load dependency situation 
occurs when an instruction immediately following a load 
instruction requires the loaded results. 

The final response the user must enter is whether 
or not to view a cycle-by-cycle execution of the simulation. 
If the user does not wish to view the simulation, performance 
results are provided at the end of the simulation run. The 
view capability lets the user observe the behaviors of the 
target architecture under varied workloads. With address. 
traces containing hundreds of thousands of instruction 
records, the user may choose to view partial executions. This 
1S accomplished by using the option of viewing the results in 
intervals. The user may elect to view interim results every 
100, 700, 10,000, etc., instructions. At each interval, the 


option of terminating the simulation is offered. 
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c. RICIS features 

(1) The Priority Event Queue (PEQ). The PEQ is a 
me@eraty qucue that stores the events that drives the program 
execution. The PEQ basically simulates the processor and the 
memory controller. There are two events that are required to 
run the simulation: issue instruction (ii) and leave memory 
(lm). The i121 event directs the simulator to issue another 
instruction from the address trace file. The lm event directs 
the simulator to remove the next request from the memory 
queue. Figure 4.6 shows a PEQ with events entered. 

The time entry is the cycle number in which the 
event can occur. The time is also the priority in determining 
which event is to occur next. In the example, the next item 
(event) to be served from the PEQ is an instruction issue, 
occurring at cycle 22 of program execution. If this were a 
FIFO queue, the next item to be served would be the Im at 


cycle 29 of execution. 





Figure 4.6 View of Priority Event Queue 
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(2) Simulating the Different Memory Queue Schemes. 
The RICIS allows the user the option of simulating a either a 
FIFO or a priority MAQ queve. It also offers the option of 
Stoning the reads and misses in the same queue or to use 
separate queues. Figure 4.7 shows the contents of a simulated 
FIFO MAQ with memory requests served on a first-come-first 
serve basis. Figure 4.8 illustrates the combined priority MAQ 
Simulation with memory requests served by precedence to the 
Given priority wave. Figure 4.9 shows the contents of a 
separate read and write MAQ simulations. Using this 


configuration, requests are served based on the priority 


assigned to reads and writes. Requests are serviced FIFO 
within their respective queues. The code entry is a 0 for a 
read entry anda 1 for awrite. The address is the. location 


in memory where the data is read from or written to. 

The priority value for determining the 
precedence of a read and a write is set by the user, or it may 
be entered into the actual code as a constant. The priority 
value determines the next request to be served from the MAQ. 
All reads will have the same priority, as will all writes. 

(3) Simulating Blocked Instructions. To simulate 
the blocked registers that are awaiting main memory access, we 
use an array consisting of boolean values for each of theme 
registers. When a read miss occurs, the target register of 


the read instruction is marked as blocked (the array index 
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Figure 4.7 View of Simulated FIFO MAQ 
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Figure 4.8 View of Simulated Priority MAQ 


Read MAO 
CODE ADDRESS 


Op E7/£LFEEaSO 
0 £7 ££Ea00 


Write MAOQ 
ADDRESS 


CODE 


0000a130 
0000a220 
0000a232 





Figure 4.9 View of Simulated Separate MAQ Scheme 


corresponding to the blocked register is set to true), 


simulating the register waiting for the data ready signal from 
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main memory. The register 18 unblocked (array index set to 
false) when the simulation indicates that the instruction has 
completed its main memory access. The blocked register array 
prevents other instructions from using the blocked registers. 
If a blocked register is referenced by another instruction, a 
stall is simulated until the register is unblocked. 
d. Calculating Performance Results 

The CPI is calculated by dividing the number of 
cycles accumulated by the number of instructions issued. The 
cycle count includes memory access penalties and stall cycles 
for load dependencies on block registers. Interim results may 
also be obtained from the simulator, and additaonen 


statistical data can be obtained with minimal modifications. 
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V. SIMULATIONS AND RESULTS OF LOCKUP-FREE CACHE INTERFACE 


A. METHODOLOGY 
1. General 

In this section we conduct performance evaluations of 
the lockup-free cache interface using the RICIS program. The 
Simulations show the effectiveness of various design 
alternatives of the interface. Three different programs were 
executed to present various workloads. In addition to 
determining the performance (CPI value) using the interface, 
a cycle-by-cycle visual trace can be generated by the user to 
observe the behavior of the system. 

-2. Structures to be Evaluated 

Evaluations are provided based on the following 
Ppeotieters: size of MAQ, type of MAQ (i.e., FIFO, single-queue 
-miss-priority, and separate-queue-miss-priority MAQ for loads 
and stores), and combinations thereof. The single-queue-miss- 
priority stores both read misses and writes in the same queue 
with read misses having a higher priority. The separate- 
queue-miss-priority stores read misses in one queue and writes 
in a separate queue with priority of service given to the read 
miss queue. For this type MAQ simulation, the combined size 
of the two queues is used as the queue size. Numerous 


possible configurations of the lockup-free cache can be 
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Simulated. Due to time constraints, however, we can only 
Simulate a few designs. In determining the effects of a 
particular interface parameter, for each base configuration 


simulated, a single parameter is varied at a time. 


3. Fixed Parameters 

The following parameters are fixed for the simulation 
experiments: (1) cache hit ratio is 0.9, (2) load dependency 
delay is one cycle, (3) experiments are done using two 
different memory access delay values, the first one is 50 
cycles, and the second is 5 cycles, (4) branch delay is 3 
cycles, (5) whenever a priority scheme is used, read misses 
have a higher priority. The read-over-write priority is chosen 
because with a high hit rate, most main memory accesses will 
be writes, thus reads would have to wait until all the writes 
are dequeued. This further increases the chance of a load 
dependency stali. 

The base configuration used by SPA to compare the 
results with that of the RICIS is the SPARC CY7C601 processor 
with the SS2 cache memory. Appendices E, F, and G contain of 
the SPA generated statistical analysis reports of the three 
test programs. To determine the effects of a lockup-free 
cache interface on a RISC processor, we calculate the CPI of 
the test programs run on the SPA. Since we are using memory 


access penalty and cache hit ratio as parameters for the 
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lockup-free cache, we insure that the same parameters are used 
with the instruction count data from SPA to determine the CPI 
of the programs without using the lockup-free cache. 

In calculating the CPI of the programs run on SPA, we must 
first know the percentage of the total instruction count that 
each instruction type (i.e., ALU, Branch, Load, and Store) 
makes up. This information is attained from the SPA report. 
We then use the following formula: 


CPI = *ALU*ta + *Branch*ta*b + #Store* (mta) + tLoad*a*r 
+ $*Load* (mt+a) * (1-r) 


where a is the number of cycles required to execute the 
instruction, b is the number of cycles for a branch delay, m . 
is the memory access penalty, and r is the hit ratio. For our; 
experiments, we use: a=l1, b=3, r=0.9, m=50 for the first eee 


of experiments, and m=5 for the second. 


B. TEST PROGRAMS 
1. General 
To evaluate the lockup-free cache interface, three 
different types of programs are _ used. These are all 
relatively short programs ranging from about 300, 000 to 
600,000 SPARC assembly code instructions. All of the programs 
are run under the Sun0S4. The programs used for this thesis 


are described below. 
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2. Pseudo Code Interpreter 
This program translates and executes a specific 
pseudo-code program. This particular pseudo-code is designed 
for a simple computer with 2000 words of 10-digit memory. The 
program reads an instruction from a memory location, decodes 
it, and then executes it. This process continues until the 
last instruction is executed. For testing the simulator the 
pseudo-code program calculates the square and square root of 
each of the numbers read from locations in memory. The 
pseudo-code program execution trace consists of 359,777 SPARC 
assembly language instructions. 
3. Launch Trajectory Calculator 
This program reads rocket launch data, such as launch 
time and range, from a file, and calculates the altitude and 
trajectory of all the launches. The trace consists of 342,440 
Instructions. 
4. Matrix Multiplication 
This program performs a 20 X 20 matrix multiplication. 
The results of the matrix multiplication are put into a third 
matrix. The lack of sufficient disk space prevents the use of 


a larger trace. The program consists of 524,852 instructions. 


Ce SIMULATION EXPERIMENTS 
1. General 
In conducting the experiments, data was collected on 


each of the test programs, using MAQ sizes of 0, 1, 4, 8, 16, 
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and 32. These sizes were chosen to determine the trend of the 
performance and to determine the optimum queue size. This 
experiment was conducted for each of the MAQ schemes using the 
fixed pemamat ere: Figure 5.1 shows the performance results of 
the FIFO MAQ scheme. Figure 5.2 shows the results of the 
Single-queue-miss-priority MAQ, and Figure 5.3 shows the 
results of the separate-queuve-miss-priority MAQ. BOT 
comparison, the CPI values of the test programs without using 
a lockup-free cache interface are: 

(1) Pseudo Code - CPI=1.71 for m=5, 4.41 for m=50 

Cyerie@erix Mult. - CPT=1.35 for m=5, 2.58 for m=50 


(3) Trajectory - CPI=1.71 for m=5, 4.33 for m=50. 


2. MAQ Sizes 

This experiment examines the effects on CPI of MAQ 
Sizes across the three configurations. The MAQ sizes range 
from 0 to 30. For first set of experiments we use a memory 
access penalty of 50 cycles. For each of the configurations 
and each of the test programs, the CPI improved significantly 
as the queue size increased from 0 to 12. The CPI value 
remained virtually the same for queue sizes greater than 12. 
The average decreases in CPI from the queue size of 0 to 12 
were 40% for the Pseudo-Code program, 41.1% for the trajectory 


program, and 15.5% for the Matrix Multiplication program. 
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(m=memory access delay) 
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Figure 5.1 Effects of FIFO MAQ Scheme 
The largest improvement in CPI occurred as the queue 
size went from 0 to 1. In this case, the average improvements 
across the different MAQ schemes were 27.0% for the Trajectory 
program, 12.7% for the Pseudo-Code program, and 10.0% for the 
Matrix Multiplication program. A queue size of zero basically 
Simulates not using a queue. In this case the processor 
stalls when a main memory request occurs and a request is 
still in main memory. 
For the second set of experiments we used a memory delay 
of 5 cycles with each of the schemes. The results show that 


there was an average improvement in CPI of less than 2.0% from 
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(m=memory access delay) 


Legend 
—-- Traject (m=30) 
Matrix (m=60) 
—— Pseudo (m=50) 
—-> Traject (m=5) 
—-— Matrix (m=5) 


Figure 5.2 Effects of Single-Queue-Miss-Priority MAQ Scheme 
a queue size of 0 to 1. There were no further improvement in 
any of the schemes with queue sizes greater than one. As with 
the previous experiments, the separate-queue-miss-priority MAQ 
configuration yielded the best performance, followed by the 
Single-queue-miss-priority MAQ. 
3. MAQ Configurations 

This experiment examines the effects on the CPI of the 
configuration of the MAQ. Overall, the separate-queue-miss- 
priority MAQ configuration presented the best performance, 
followed by the single-queue-miss-priority MAQ. For the 


Pseudo-Code program with m=50, the CPI values ranged from 1.75 


Ot 


(m=memory access delay) 


Legend 
—-- Traject (m=60) 





Figure 5.3 Effects of Separate-Queue-Miss-Priority MAQ Scheme 
to 2.22 using the FIFO MAQ, 1.67 to 2.13 using the single- 
queue-miss-priority MAQ, and 1.41 to 1.85 with the separate- 
queuve-miss-priority MAQ. For the Trajectory program with 
m=50, the CPI values ranged from 1.70 to 2.07 (FIFO), 1.61 to 
2.07 (single), and 1.37 to 1.81 (separate). The Matrix 
Multiplication program with m=50 had CPI values of 1.35 to 
1.42 (FIFO), 1.30 to 1.39 (single), and 1.15 to ie 
(separate). Using the optimal queue size of 12 and m=50, the 
separate MAQ scheme performed an average of 188% better than 


not using a lockup-free cache; the single-queue-miss-priority 
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MAQ scheme performed an average of 148% better; and the FIFO 
MAQ performed an average of 135% better. 

Using m=5, for the Matrix Multiplication program, CPI 
values ranged from 1.10 to 1.11 with the FIFO scheme, from 
1.10 to 1.11 with the single-queue-miss-priority scheme, and 
unchanged at 1.10 throughout the different sizes with the 
separate-queue-miss-priority scheme. For the Pseudo Code 
meogram, the CPI ranges were 1.32 to 1.33 (FIFO), 1.32 to 1.33 
(single), and unchanged at 1.29 for the separate scheme. The 
CPI results for the Trajectory program were 1.32 to 1.33 
(FIFO), 1.32 to 1.33 (single), and unchanged at 1.28 for the 
separate scheme. We notice that there was little or no change 
in CPI using the different schemes for each of the programs. 
This is because with a small memory access penalty, the queue 
does not grow much and the turnaround time for dependent data 
is minimal, thus greatly reducing the chance of a memory delay 
eet), 

Also with m=5, the separate MAQ scheme performed an 
average of 30% better than not using a lockup-free cache; the 
single-queue-miss-priority MAQ scheme performed an average of 
20% better; and the FIFO MAQ performed an average of 19% 


better. 


D. SUMMARY 
In this chapter we have presented a high-level simulation 


to study the performance of a lockup-free cache interface on 
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a RISC architecture. The simulations provide indication. 
how various cache interface designs may perform. Overall, we 
found that the use of a lockup-free cache resulted in a 
performance improvement of up to nearly 200%. 

One observation is that the size of the MAQ is a 
considerable factor for each of the designs. From a queue 
size of 0 to about 12, the CPI values improved. As the queue 
size exceeded farther past 12 there was little or no change in 
the CPI. 

Another observation is that the design of the MAQ also had 
an effect on performance. Whereas each of the cache interface 
designs showed an improvement in CPI, the separate-queue-miss-— 
priority MAQ configuration yielded the CPI values. This is 
probably attributed to the separate queues allowing both read 
misses and writes to have assured space. This is not the case 
with the single-queue-miss-priority queue where the MAQ may 
consist of all the same types of entries. The separate MAQ 
configuration may also further prevent a processor stall. For 
example, if the write MAQ is full and a read miss occurs, the 
processor will enqueue the read miss and continue processing 
if the read MAQ is not full. 

Finally, we observed from the results of using the memory 
delays of 50 cycles and 5 cycles that the lockup-free cache 
interface is more effective as the memory access penalties 
increase. Also, as memory access penalties decrease, the 


optimal size of the MAQ also decreases. That fact emphasizes 
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that the need for a lockup-free cache interface will grow as 
the discrepancy between CPU speed and main memory access 
penalty grows. Memory access penalty in future systems are 


expected to exceed 140 cycles [Jou90]. 
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VI. CONCLUSIONS 


AS OVERVIEW 

In this thesis we presented a study on the RISC 
architecture and its unique features. Our emphasis was on the 
effects on RISC performance of a lockup-free cache interface. 
The lockup-free cache interface features a queue to hold 
memory requests, allowing processing of program to continue 
while the memory requests are being served. We accomplished 
this through simulating the execution of actual programs. 

To simulate and analyze the performance of a RISC and a 
lockup-free cache interface, we used several tools: SPA 1.0, 
SATTA, and RICIS. SPA is an available set of tools used to 
trace and analyze the execution of programs. We developed the 
SATTA to transform the address trace generated by SPA to 
detailed, more readable instruction records, and to produce 
modified address trace files for the RICIS. We developed 
RICIS to simulate and measure the effects of a lockup-free 
cache interface. 

We examined various alternative schemes of the cache 
interface. The major design issues addressed were: (1) the 
size of the memory request queue, (2) whether to use a FIFO 


policy or one based on an assigned priority, and (3) whether 
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to use a single queue to hold both reads misses and writes in 
the same queue or to store them in separate queues. 

Results of the experiments showed that the lockup-free 
cache considerably improved the performance of the RISC. The 
queue scheme and the queue size each had considerable effects 
on the performance of the cache interface. The separate- 
queue-miss-priority scheme yielded the best performance 
results, followed by the single-queue-miss-priority, and then 
the FIFO queue. For each scheme, the performance improved as 
the queue size went from 0 to about 12, after which the 
performance remained virtually unchanged. The greatest 
improvement was noted as the queue size advanced from 0 to l. 
me queue size of zero has the same effects as having not used 


a queue in the lockup-free cache interface design. 


B. FUTURE RESEARCH 

The area that requires most emphasis for continued work is 
the simulation environment. The RICIS can be modified to 
Simulate the fetching and executing of instructions out-of- | 
program-order. This would determine if further processor 
stalling can be prevented by enqueuing instructions that are 
dependent on those instructions in the memory request queue. 
The dependent instructions are re-issued as the dependency 
problem is resolved. Meanwhile, the fetching and execution of 
new instructions continues. This feature is only partly 


implemented in RICIS as the problem of the dependency posed by 
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new instructions that are dependent on those instructions in 
the instruction queue needs addressing. 

Another area for future research is the simulation of 
floating point (anstructaenc. Later versions of SPA may 
provide this feature. Minor modifications to SATTA and RICIS 
would then be required. Finally, the user interface for both 
the SATTA and the RICIS could be improved. An interface that 
combines the three simulation tools would greatly improve the 
user environment and conserve considerable computer resources, 
thus allowing experiments using larger, more reliable 


benchmarks. 
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APPENDIX A. USING THE SPA 1.0 SIMULATOR 


The SPY Component: Address trace generator. 

In tracing the execution of a program, the SPY can pass 
the address trace directly to a trace analyzer or to a file. 
For instance, the command 

% SPY myprog 
traces the execution of the (executable) file myprog and 
passes the results io) a file Ot the format 
progname.pid.invocation number. The command 

% SPY -p ‘’spanner | spout’ /usr/ls 
generates an address trace of the ls command and passes it to 
the SPANNER program. The SPANNER then performs its functions 
and pipes the results to the SPOUT. The SPOUT display the 
results of the trace. The -p option directs SPY to pass the 
results to the analyzer. Similarly, the command 

% SPY -p spanner CC myprog.C -o myprog 

traces the execution of the CC command and passes the results 


to SPANNER. 


The SPANNER Component: SPARC instruction analyzer. 
The SPANNER program reads the address trace file generated 
by the SPY program and compiles instruction count information. 


This information includes the number of times each type of 
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instruction was executed, and the number of cycles taken by 
each type of instructions. SPANNER also calculates and the 
number of cycles consumed by simulated cache misses, and it 
provide numerical data concerning conditional branches and 
window handlers. 

SPANNER offers the user various options on _ system 
configurations as show in Figure A.1. The -c option lets the 
user choose the type of cache memory system to use for the 
Simulation. The choices are the SSI1, used in the SPARCstation 
1, and the SS2 used in the SPARCstation 2. The =pyopeeen 
specifies the type of processor to simulate. The choices are 


the MB86901 processor, and the Cypress CY7C601. 


Spanner [-c cache] [-p processor] [-0on] 


[-un ] [-rn] [-wn] filename 





Figure A.1 SPANNER Command Format 


The other options allow the user to set the specific 
number of cycles or the specific size for particular events. 
The -o option lets the user specify the number of cycles 
consumed by a register window overflow. The -u option sets 
the number of cycles for a window underflow. The -r lets the 
user set the interval, in cycles, to view interim output of 
the trace results. The -w option specifies the number of 


register windows to be simulated. 
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Default values are set by the SPANNER for each of the 
options. The default values closely resemble the features and 
characteristics of the SPARCstation 2. The SS2 is the default 
cache, the CY7C601 is the default processor, 170 cycles is the 
default for the register window overflow, 110 cycles is the 
default underflow cost, and the default for the number of 


register windows is 8. 


The SPOUT Component - Instruction Count Tables Generator 
The SPOUT component formats and displays tables of 
instruction count data obtained from the SPANNER. A 
discription of the simulated architecture configuration is 
displayed in the report heading as shown in Figure A.2. 
Spanner - SPARC Performance Analyzer 
Cpu: Gv Teoor 


cache: ss2 
register windows: 8 


overflow cost: 170 cycles 
underflow cost: 110 cycles 





Figure A.2 SPOUT Report Heading and Parameter Settings 


Figure A.3 is a table from a SPOUT report which shows an 
overview of the instruction and cycle count of the program 
trace. This table shows that 65.9% of all the cycles taken up 


by this program execution was taken up executing 


67 


instructions, 2.1% was taken up by window handlers, and 25.3% 


by cache cycles. 


OVERALL overall (%) category (%) 
eycles inst. cycles countwae cles 
instructions coe : SD). = 187588 
annulled slots ; ; : = Dood 
load-use stalls : ; ‘ = 13024 


trap cycles : : 284 
window handlers g ‘ ‘ 6050 
cache cycles 


2 Ose 





Figure A.3 SPA Overall Instruction Count Listing 


Figure A.4 is another table from the same SPOUT report. 
This table displays the trace data of memory access 
INSt muck loons onmin Here, we see that 19.7% of the cycles 
required to execute the program was consumed by load 
instructions, and load instructions account for 86.6% of the 
memory access cycles. Other information in this table is 
18.7% of all the total number of instructions traced were 
loads, as with 90.7% of all the memory access instructions. 
The raw data column shows that there were 27720 load 
instructions traced, consuming 55440 machine cycles. 

SPOUT reports also contain similarly formatted tables of 
data for each SPARC instruction, each window size, each type 
of cache, and control transfers. The SPOUT report provides 
the user with the data to determine what instructions, events, 
and configurations have the greater effects on the 


architecture performance. Also, CPI values can easily be 
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MEMORY ACCESS overall ($%) category (%) 
cycles inst. cycles count cycles 
load 19 cc! 1S) 86.6 S057 55440 


store S20 ee 9 13.4 oS 8556 
atomic 


1CC202 el007 0 63996 





Figure A.4 SPOUT Memory Access Instruction Count 


obtained and compared by dividing the raw cycles value by the 


corresponding raw instructions value. 
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APPENDIX B. USING THE RICIS PROGRAM 


Obtaining Address Traces. 

In order to use the RICIS, an address trace of a 
executable program must first be obtained. This address trace 
can be produced using the SPA 1.0 package, as explained in 
Appendix A. After obtaining a trace file from SPA, a modified 
version of trace must be produced for use explicitly by the 
RICIS. This trace is produced by the SPARC Address Trace 
Transformer/Analyzer (SATTA) tool. 

To produce the modified trace using SATTA, simply type the 
command SATTA at the command line Prompe is). The program 
will then produce the trace file, naming it RICIS.FIL. The 
user may rename the file to a more suitable file name after 


the file is generated. 


The User Interface 

To begin a RICIS session, type in the command RICIS at the 
command line prompt (%). The program then asks the user a 
series of questions to define the parameters and scheme of the 
lockup-free cache interface to simulate. Figure B.1 is an 
example of a start-up session for RICIS. The first inpue==ao 
the system is the address trace file name. Again, this is the 


file produced by SATTA. The next input is the cache hit ratio 
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d 
a -_ A go — a — ll — ttl 


to be simulated. The value entered must be represented as a 
percentage from 0.0 to 100.0. For example, to enter a hit 


morro Of 70.9, the user must enter 90.0. 


Saori 


Enter name of file to parse 
rocket.luf 


Enter simulated CACHE HIT RATE: 90.0 


Simulate FIFO or Priority Queue? (f/p) : f£ 


Enter Memory Queue Size: 5 


Do you want to use Dependent Instruction Queue? (y/n) 
Enter number of stall cycles for Load dependency: Il 

Do you want to view Queues after every activity? (y/n)..n 
Enter interval value for viewing Queues: 600000 


Do you want to continue with simulation? (y/n): n 


DowyOu waneero Co wanother simulation?  (y/n) < y 


Keep same parameters? (y/n): y 





Figure B.1 RICIS Startup Session 


The next input deals with the type of queue to simulate. 
The user must enter the character £ for FIFO simulation, and 
P for priority queue simulation. If FIFO is chosen, the next 
input is the size of the queue to simulate. This value must 
be an integer between 0 and 50. However, if the priority 
queue is chosen, the user must chose whether to simulate a 
Single queue or seperate queues for reads and writes. If the 


seperate queue is chosen, then the user must enter the size of 
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each queue. If the single-priority queue is chosen, the user 
enters the size of the queue. 

Also, if a priority queue is chosen, the user must enter 
a priority number for a read and for a write. For example, if 
a read is to have priority over a write, then a value of 0 is 
enter for Read Priority and a 1 for Write Priority. The next 
question deals with the Dependent Instruction Queue. The user 
must enter the value n for this response, as this feature is 
not fully implemented at this time. 

The user may elect to see a cycle-by-cycle trace of eine 
Simulae ens If the user wants to see only the CPI results, 
the he/she must enter ann. The last start-up input to the 
Simulation is the interval in number of instructions in which 
to view interim results. If the user wants only to see the 
final results, then a very large number must be entered. 
Since traces files contain hundreds of thousands of instrucion 
records, a value of 1,000,000 may do it. The user can obtain 
the number of instructions from running the original trace 
through the SATTA program. 

At each interim result pause, the user is given the option 
of terminating the session or continuing. This feature is 
particularly useful for long, slow sessions. After each 
session terminates, the user is given a choice of conducting 
another session. If the user chooses to do another 
Simulation, he/she may choose to keep the same parameter as 


the most previous session, or to enter new ones. 
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RICIS Output. 
The follow is an example of RICIS results output: 

Number of Instructions Executed: 642321 

Number of Cycles Elapsed: 834232 

CPI Value: 1.30 
The first line of the output shows the total number of 
instructions issued from the trace file. The second line 
shows the total number of (simulated) cycles consumed by the 
program execution. The last line shows the CPI value, 
attained by dividing the total cycles by the _ total 


instructions. For interim results, the values shown would be 


the results up to the instruction count. 
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APPENDIX C. SPARC ADDRESS TRACE TRANSFORMER/ANALYZER 


J [ REAR RRARRRHERERRERHERRAEREREEREKEEKKERAREAAKARKKAKRKKKAKAAARALAAKAR AH ANY 6 


// 
A 


Title: SPARC ADDRESS TRACE TRANSLATOR/ANALYZER (SATTA) 

Author: Leonard Tharpe, Captain, U.S. Army 

Date: September 1992 

Revised: 

Description: This program simulates a cycle-by-cycle execution of a program using 


the the Sun 4 SPARC architecture. The program takes as input a binary address tra 


// a compiled executable program and provides the user with a detailed instruction reco: 


77, 


every cycle of program execution. This information includes the cycle/instruction nu 
the status of the instruction, a 32-bit representation of each instruction, the og 
the location in memory of the instruction fetch, as well as the data fetch. This pr 
also provides data and information on register use. This information consists of 
number of times each register is used as a source and as a destination, and it pre 
data on register dependency. 


WH HHHHKAKHEAHKAEKAHKHAK EHH KARE KER AA RHA HHH H KARA HR HRHK AKA KEAEAKAKREAEA HAAR HES 


#finclude <stdlib.h> 
#include <string.h> 
#finclude <iostream.h> 
#include <iomanip.h> 
finclude <fstream.h> 


Li 
iT 


yy 


** This is the format of each instruction record produced by 
the address trace. : 


Struct Instruction 7 


}; 


char et; 
char dav; 
SNOEG, En; 
unsigned long op; 
unsigned long ia; 
unsigned long da; 


// ** This is the record format for register-use data. 
struct reg data { 


Me 


int source; 

int dest; 

int last_use; 
int lex dist; 
int tot_dist; 
float avg dist; 


// ** This structure is used to hold and calculate register dependency data 
struct dependency { 


d? 


int last write; 

int ref dist; 

int load count; 

int ref count; 

in€ Cot dist; | 

int ld used; | 

float avg dep dist; 
: 
f 


Ve HHH A RHA HEHEHE AHR AKA AKKESE Begin main program HERA AAR EAA HEHEHE AHHEA EH | 
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main () 


{ 


imc ctr = 0; 

int loads = 0; 

int stores = 0; 

int avg count = 0; 

float avg _dist = 0.0; 
float tot dep = 0.0; 
unsigned long op field; 
unsigned long op_value; 
unsigned long bicc_cond; 
unsigned long ticc cond; 
unsigned long displ; 
unsigned long rd_value; 
unsigned long rsl_value; 
unsigned long rs2_value; 
unsigned long inst_hold; 
unsigned long index bit; 
unsigned long annul bit; 
unsigned long asi; — 
unsigned long simml3; 
unsigned long imm22; 
unsigned long disp1l22; 
Ene nef type; 
Hasteruction inst; 

reg data reg count [32]; // keeps data of register use 
dependency reg dep[32]; // keeps data of register dependency 


{foo-cccccc function definitions --------- rn 

void regcalc(reg_ dataé& reg, int& count); 

void depcalc(dependencyé reg, inté& location, inté& ref); 

void check reg dep(dependencyé reg, int& count, inté& ref); 

void clear screen (inté 1S 

void start up(char *ifile, char *ofile); 

Vem Instr count (char “afile, Instruction& instr, int& icount); 


y / RAHKKKAKKKAKKAKKAKEASE array of registers (notation) HHRMA KAKKAKAEKKAEAKSE 
SG2tre Char reg sym[32] [5] = 
{"%g0", "sgl ae Le fe Ohio nS 3!" "%g4 ur "%q5", "tg6", Leg? ; 
"00", "Zol Le “For, eos", "E04 ae Oa "Too", Ns07", 
csTO" "E17 w Eti2" ea "$14 c ae Se "$16", "o7 a, 
fro", "Sil 3 ati2" SL "Sig", or ES1 6", Wer 7" 


f/ RHEKKKKKKAKKKEKEK array of format a op =e 1] opcodes KRAHKKKKAKKKKAKAKSE 
Seatse char op!il inst {6d} [10] =- 
(UL hfe Lie "dub", "lduh a Tad’: "st o sep 
"sth es ate "unimp", "ldsb", "ldsh UP "unimp ie 
lunampo . baseub  ;"unimp”, "swap", lda","Lduba"™, 
Giduimas, “ldda’, “sta, 'stba", "stha", "stda","unimp", "ldsba", 
Wigeha 7, unimp ' , “unimp”, 'ldstuba"”,“unimp", 
"“sSwapa am Lah. “Larsr tein. Claat NEEL. USCLret 
cen 7 Segl , unimp , unimp ,;"unimp”, "unimp”, “unimp", 
mnie ey UniMon untmp , Lae, Ldesr',unimp', “lddc", 
CSGeereoLesr:  Stacg’, Stdc", unimp”, “unimp", “unimp", 
SWwitnpe,. nimp  ) unimo', unimp’ , “unimp”'} ; 


vias MARK AKRKAAKEKESK array of Bi format op = 10 opcodes RHA KKKAKKKKKAKAKAEAKEKSE 
static char opl10 inst[64] [10] = | 

{"add", ena’. “Ort weOor', VSub - "andn eS "orn i mnOr. | wadax"', 

ompe Unimp | unimp ,'Subx', “unimp”, “unimp’, “unimp”, 

VaGedecus anace ./orce’ .“<orce”'’ "“subcce”, "andncc”,"orncc", 

_l@ meee tacdtee unimD 7. Unamp , unimp", “subxcc"™, "unimp”, 


fee) 


“unimp", "unimp"”, "taddec”, “€subcce", "tCaddcctv' 7 Esubece” 7. uae 
"sll", "srl", "sra","rdy", “rdpsr", "rdwim", “cdepr 7) unwnp ome 
“Unamp, 7 “unzmp 7, OW IVa "wrpsr", "wrerwim", "WrEebr”, “fpopi a “i poOpZ a 
"“cpopl", "cpop2", "jmpl”, "rett", “tice”, "iflush | (save restore. 
‘unimp~, “onamp <p 


Do, KHKKKKHAKKKAKAKHEAKAHK array Of Trap (ticc) opcodes AHHH HKKAKHKKKAKAKKEAAKEHA 
static char opl0.}ticei iG ao) am 
Cem ps "Ca, SE Lease rie cue "tet leu ae "ecs ie "“tneg", "tvs ie "ta cee "ene 
CEG "tge", “tgu ae MEce.. VEpOS. FEV CI 


ii RHEE KKAHKHKAKKAHKHK array of format Ly op = 00 opcodes RHA MRHMRHEEHKHAKKHAHEKAHAH 
Static char op00_inst[8@j [15] = {("unimp’” “uninp [eee 
"unimp"”, “sethi", “unimp”, "“fbice 7 coece ws 


Vs MRK KKKKKKEAKRAHK array OL branch condition opcodes AHA KAKKEKEKEKEKAKKAKAKRKAAKHK 
static char op001_inst[16jJ [15] = [{"bn”", “be”, “bie”, "bi, Dieu 7 fea 
"bneg", "bvs hy "ba a "bne", “Doe, "bge i "bgu oS Ubec "bpos oy "“DVe "Fe 


{fo 2------------ initialize register data arrays ----------- 
For (int kK 4#,0;- kh < 352 -7cr7) 
{ 
reg count [k].source = 0; 
reg count [k].dest = 0; 


reg count/[kj.last_use = 0; 
reg count [k].lex dist = 0; 
reg count (kj, €0C dist== "7. 
reg count [k]-avg dist = 070; 
he 
ae initialize register dependency arrays --------- 


for. (ant n®e20; n < 327° nF) 

{ 
reg dep[n].last write = -1; // last line register written to 
reg dep[n].ref dist = —1; // distance between reference and last write 
reg dep[n].tot dist = 0; 
reg dep[n].load_count = 0; 
beg depin] 7 fer Counc =u, 
reg _dep[n].ld_used = 0; // was register’s last use a load? 
reg dep[n].avg_dep dist = -1.0; 

}; 


ofstream recfile("records.dat"); 

ofstream cachefile("RICIS.FIL"); 

fstream infile; 

float load_percent; 

char cont; 

Char view reg; 

char view dep; 

char nop = ‘n’; 

int i, interval; 

ing. 1c; 

int 1 13880, // Poad ins€ruction Ol= inc eie— sa 
int (S igee@d; 7/ sGore 2nstuceven 

char -fitelsfz0], fileZziZoi- 

clear screen(i); 

Start up(filel, file2); 

inste count (s11e1,insteic), 

cout << "\nEnter cycle intervals to view output: "; 
cin >> interval; 

ofstream asmfile(file2); 

// --- cout << "file pointer declared " << ’\n’ << flush; 
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rt? .e.open(filei,10s::in/ios: -nocreate) ; 

Py === Cout << "file iS now open ” << ‘\n’ << flush; 
infile.seekg(ctr*sizeof(inst),ios::beg); 

/{ ~-- cout << "seek invoked " << ’\n’ << flush; 


// ** Read executable address trace until end of file reached. 
while (infile.read((char *) (&inst), sizeof{(inst))) { 


mecniteom<<. ‘recora = << deéem<< ctr << ‘\n’ << flush; 

Boeri heme Cxeoeustatus , " << int(inst.et) << *’\n’ << flush; 
Beep tmemoas Valigowoqar @  " << int finst.dav) << “\n’ << flush; 
meer Cuma Crap HOw: im< SNOre{inst.tn) << °\n’ << flush; 


recfile << “instruction 
meerile << 1nt (inst.op) << “\n° << flush; 


recfile << "binary representation : "; 

// -------- Pring wenee Die ene presentation of opcode ——==-— 
mae 2, mask; 

mask = I; 


mask <<= 31; 

inst hold = inst.op; 

op field = inst.op; 

annul bit = inst.op; 

for (i = 1; i <m 32; ++i) 


( 


Beet euaami (int (1nst.op)\- 6 mask) == 0) ? “0" : *1'); 
inst.op <<= 1; 
} 
Prerectile << ‘\n’; 
[[ ------------------------------------------------------ 
{fo occccccccco- extract "op" field -------------------- 
Comereld== (op frela >> 30); 
mec oucam \nop field ; “ << int(op field) << “\n’ << flush; 
J [rar 
CO SSS eee ae an ee ee Oe) 1 
if (op field == 1) 
giopile= Inst fold G Oxj,f frre ert; 
LreGuatieua<. Opeecde sucali ' << *\n" << flush; 
Poem! toma.  ‘droolacement =: "' << int {displ) << ’\n’ << flush; 
asmfile << setw(5) << ctr << ": ae 


asmfile << setw(-10) << "call "; 
asmfile << int (displ) << ’\n’ << flush; 


eaenerite << " —2  —" << setfill(’0’)-; 
cachefile << setw(8) << hex << inst.ia; 
eagenefile << " eee SE 
he 
(0, ea SS examine format 2 —- op = 00 ---------------- 


Pe Ome CL y= = 0) 
{ 


Me Siete oO vCOdewm eo) ==——=—— === —————— 
HOD. =" ons; 
Comvalue = (instunold & Uxic00000) >> 22; 
recfile << "opcode value= —<a—1hnt (op value) "oe in << flush; 
// --- check for bicc instructions ------- 
if (op value == 2) 
{ 


bicc cond = (inst_hold & 0x1e000000) >> 25; 
Beem te 8-- opcode =) =< op00l inst {biec cond] << ’\n” << flush; 


}; 


// ---- check for ‘nop’ instruction ----- 


a7 


Tee (strncemp ("sethi",op00_inst [op_value],5) == 1) 
recfile << “opcode > “ << op00 inst {op valuc] << 9\2 ==- Sa 7 
else 
nop = ’y’; // this instruction is a potenttale mop 


{fo ws~en=— Check annul pit 9 ————___ ee 

annul bit y= (annul Die >> 27)/; 

if (annul _ bit == 0) 

( 

{/ = [saa asa ae extract rd register field —==———— ae 
rd value = (inst _ hold & 0x3e000000) >> 25; 
recfile << “rd vaiue ; “ << int(rd value) —< '\n’ << flush 
if (nop ==7' n-) 
( 


reg_count[rd value].dest += 1; 
regcalc(reg count[rd value], ctr); 


[fo wren ner rr extract immediate address ------------- 
imm22 = inst_hold & O0x3fffff; 
if (((nop == ’y’) && (imm22 == 0)) && (ralvalue =—s777 
{ 
recfile << “opcode: nop “ <<F)" \n7 acres 
} 
else 
( 
if (op_value != 2) 
recfile << “opcode : "“ << op00_instflop_ value] << “\n""<<@fiae 
recfile << "immediate address : " << hex << imm22 << ‘\n’ << flush; 


reg -Coune, rd Vaivejaesc sa, 
regcalc(reg count (rd_ value], ctr); 
cachefile <<-" 2. "<< sertaii 02 
<< setw(8) << hex << imm22 << " a 


<< setw(2) << hex << rd_value << ’\n’ 


asmfile << setw(5) << ctr << ": He 

if (((strnemp ("sethi”, op00_inst fop_ ae 5) == 0) 
&& (imm22== 0)) && (rd_value == 0)) 

{ 


} 


else 


( 


asmfile << setw(-10) << "nop " << ‘\n’ << flush; 


if (op value == 2) 


asmfile << setw(-10) << op001_inst[bicc_cond] << " ue 
else 

asmfile << setw(-10) << op00_instf[op value] << " Ws 
asmiile! << inm22 << 7; 


asmfile << reg_sym[rd_value] << ’\n’ << flush; 
else // annul bit is set 


bicc_cond = (inst_hold & 0x1e000000) >> 25; 
recfile << “annulled ° (73am 6 << (flush 


ToS 


Pa em a eeeonOtt Lower ~ OpOOT inse/Dicc cond] << “\n’ << flush; 


f fo 7-7 extract displacement value ------------- 

GiSspiz2 -tinsesoid & OxItfitryr ; 

recfile << "displacement value : " << dec << displ22 << ’\n’ << flush; 
[[ ---------------------- == ------------- === === === ------ === 


acne lew <9) SCLW lo) —<acer <<" ee 

asmfile << setw(-10) << oes inst [bicc Cena), << ” ah 
aSmEiLle << dispiZ2 <<" 7 : 

asmfile << "(annulled)" << oan en ee LUSi, 


Gacnetile << " 3 ”" «<< setfill/(’ 0’) 
<< setw(8) << hex << inst.ia 
cm fr ff cc aN 
} 
VG 
ee examine format 3 — op = 10 ---------------- 


memopetield —— 2) 
{ 


00 SSeS SSS SSS SS = extract opcode field ----------------- 
op value = (inst hold & 0x1£80000) >> 19; 
nomenon ae OpeCdcevalucl- << int(Op value) << “’\n’ << flush; 
recfile << "opcode : " << op10_instfop_ value] << ’\n’ << flush; 
// —---- check for ticc instruction ---<--------------------- 
if (op value == 58) 

{ 


Elece Cond (inst_hold & 0x1le000000) << 3; 
e2eCc cond Clee cong >. 28; 
eo eo eeOtO mm  —<sonl OC Cicce{/Eicc cond] << “\n’ << flush; 


ae Saale CEG be oisee: £1e1G (maa eee = ——— 
rd value = aoe hold & O0x3e000000) >> 25; 

Bectiwiew<< 'rdevalue > " << ie tdevalle) <Qeait Vocerlush, ; 
regmcount (ra vaiue].dest += I; 

regcalc(reg_ Count had. Valua)> wCer) 


—_— ee ce cc cc em ec ee Se ee ee ee ee ee Gs ee Ss GS SS ce 


aaa Sascha gi eter field. 

moet velle —Siirec hold & Ox/egc0) >> 14, 

Rect! omar orevaine 2! << Tne (es 1 Value) << \i ea flush; 
ref type = 0; // reference type is a read 


depcalc(reg_ dep{rsi_value], ctr, ref type); 
reg count (roi. value].source += 1; 
regcalc(reg_ count [rs1_ Value) eaesce rT); 


| [0 wanna nnn nnn nnn nnn nn nn 
{[fo-cr--- extract index bit ------------------ 
ticdex ma inst dota & Ox2000) >> 13, 
BeGHl Lona Iida wbit. . fic< mm@eq@ndexebrt) << “\n << flush; 
| [wan nnn --- == - 5 == === = === === === =- == -- = -----=- 
if (index bit == 0) 
{ 
le ee extract rs2 register field ------------- 
Boca UC = nae med 9a 6061 t, 
PeCtPie.<<aU'rsZ value =" << Iie (pse value) << Iw =< flush; 
ref type = 0; // reference type is a read 


depcalc(reg_ dep(rs2_ value], ctr, ref type); 
reg_ Gountyicc value].source t= 1; ; 


V9 


regcalc(reg count /rsz vaiue!, cum, 

asmfile << setw(5) <<"“ctre-<. "<< SCL ELLIE” 27m, 

if (op value == 58) 

asmfile << setw(-i0) << opl0 Cicc/ rec tecnd) os 
else 

asmfile << setw(-10) << op10_instf[op value] << " ne 
asmfile << reg sym[{rsl_value] << ","; 

asmfile << reg_ sym[rs2_ value)! << jee 

asmfile << reg sym[rd_ Value] << "in" << sflusEe 


cachefile << " DV << seer. 40 
<< setw(8) << hex << inst.ia 
<<¢ " " << hex << SebW (2) 3<<—  rsievel ue 
<< " " << hex << sétw(2)) << rs2evalue 
<< " " << hex << setw(2) << rd value << ’\n’ 


} 
else // index bit is set 
{ 
Simmi3  ® inst “Nola 450412. Ge 
recfile << "“simmi3 : “<< ant(simmiome-< “\n 0c oe. 
asmfile << setw(5) << ctr << ": ms 
if (op value == 58) 
asmfile << setw(-10) << op10_ticc[ticc_cond] << " " 
else 
asmfile << setw(—10) << opi0 anst/op valuc ae a 
asmfile << reg sym[rsl value] << ","; 
asmli le <<n ine simi cn 
asmfile << reg sym[rd value) << ‘\n“ << flush, 


Cachefile << "2  ' << setrumi oe 
<< setw(8) << hex << simm13 
<<." " << setw(2) << rsi value << " i 
<< " " << sSetw(2) << hex << sdevalue cae 
} 
}? 
ee Exam Pie i Sia ts en) oii ee 


if (op field == 3) 
{ 


| fo rrr rrr coe extract opcode field --<<-<<<-------—---—— 
op value = (anst hold & Cxits(Ul) =. 
recfile << "opcode value : " << int (op value) << ’\n’ << flush; 
recfile << “opcode : “" << opil_instf{op_ value] << ‘\n’ << flush; 
fl a 
/ /. =-===—— =a extract rd register field -———— eae 
rd_value = (inst_hold & 0x3e000000) >> 25; 
Ye isolate load instructions =<-<-<——=-——-———————— 
li=0; 
if (strncmp("ld",opll1_inst [op value],2) == 0) 
( 
l ie tT; 


loads += 1; 

ref type = 1; // reference type is a load 
rege | dep[rd_ value].ild used = 1; 
depcalc(reg_ dep[rd_ value], ctr, ref type); 


a lsolate SCOLE AMSt PU CC 2 Oi ee 
Sa = 05 
z1£ (strnemp(“st",opii ins€ {op vaiue] 72)——=a0) 
{ 
S22 oa 


stores t= 1; 
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Me 


Begtt eae eet ileme ms sca sime (na value) << “\n’ << flush; 
Beg COune | Gmvcrte,.dest += 1, 
regcalc(reg_ ieoume (rds Vane]; clr); 


[ [mmm nnn nnn nnn nnn 
naa extract rsl register field —-----------—— 
mele values = (instenota=s 0x7c000) >> 147 
recfile << "rsl value : " << Poet nsl value) << “\n> << flush; 


mer type = 0; 

depcalc(reg_ dep[rsi value], ctr, ref type); 
reg. Cay) ge frsi_ value].source t= 1; 
regcalc(reg_ count [rs1_ Value], Cer) ; 


mm ce crm cm cm mec cc cmc ce cmc mm es ss cs es ee ce ee a eS ae 


fe extract index bit ===—=-—-—=-—----———— 
micexwotmm= (inset Nola G@ 0x2000) >> 13; 

Mec ome tide DiGu, s =<< Iftqindex bit) << “\n’ << flush; 
ee a es ee ee et eee 


{ 

ff -cccccsc- extrac€ rs2 register field ------------- 
rs2 value = inst hold & Ox1f; 

Berle wns 2 vane s:) " << int (rs2 value) << ’\n’ << flush; 
ref type = 0; 

depcalc(reg_dep[rsl_value], ctr, ref_type); 
reg_count[rs2_value].source t= 1; — 
regcalc(reg_ count [rs2_ Value) Gtr) 


| [wan n- nn === == - 5 == === == 
lt een extract asi field ------------<------ 

aot i nsemioled ee Oxl fed) >> 5, 

neg LO Eo—maste Value = “ <<qvinmt (asi) << *°\n’ << flush; 
[fo wane nn nnn nn n= 
BsoMmrateuc< SeCtwl(5) << ctr << ': os 

asmiile << Setw(ai@jec< opli inst [op_ value] << " ae 


asmfile << reg sym[rs1_ Varuey << "+e 
<a reg sym[rs2_ value] ia 
asmfile << reg sym[{rd_value] << ’\n’ << flush; 


if ((1_i == 0) && (s I == 0)) 
{ 


Cachnerriema<) 92 " << Setfill(’0’) 
<< setw(8) << hex << inst.ia 
<< " " <€<¢€ setw(2) << hex << rsl value 
on Scouser wyo << hex << rs2 value 
ee seem) << hex << rd value << ’\n’; 


Me 
if (1_i == 1) 
{ 


cachefile << " 2 " << set fill(’0’) 

<< setw(8) << hex << inst.ia 

“een SetW (2) << hex << rsi value 

<< " " ¢<¢ setw(2) << hex << rs2 value 

ee eee Cet w (2) << hex << rd value << 7\n‘; 
Sacheritewca Oe << setfill(°0’) — 

<< setw(8) << hex << inst.da 

eee ne 


<< setw(2Z) << hex << rd_value << es 
M/ 
Lijit ==" 1) 
{ 
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cachefile << " 2 ©"S44 setts) 


<< setw(8) << Nez << 4nstl 22 


<o " " ¢<¢ gsetw(2) << hex << rsi value 

ge " << setw(2) << hex << rs2 value 

<o " " e¢ setw(2) << hex << rd value << “\n’-; 
cachefile << " 1 “ <2"seppii7 70 — 


a 


<< setw(8) << hex << inst.da 
cee re 
<< setw(2) << hex << rd value << ’\n’; 


else // index bit set 


( 


sSimm13 = anst hold 4 0x7 feLe. 


recfile << 
asmfile << 
asmfile << 
asmfile << 
asmfile << 


"simml13 :; " << Gnt(simml3)e<ce (nee elu 
setw (5) s<<tcer <<) ut (<< Set fiat |” =a) 
setw(-10) << opll_instfop value] << ” Le 


reg sym[rsl value] << "+" << int (simm13); 
"," << reg symird value] <<= \nas<<ertush, 


if ((1_i == 0) && (s_i == 0)) 
{ 
cachefile <<" 2..." << seefill (70 > 
<< setw(8) << hex << inst.ia 
<< " "<< setw(Z2) << hex << psi value 
ae re re 
<< sétw(2) << hex <<(rd valuce<<5 na 
Ve 
if (1 _i == 1) 
f 7 
cachefile << "" 2 “<< score manee . 
<< setw(8) << hex << inst.ia 
<< " " << setw(2) << hex << rsl_value 
cc ee ee 
<< setw(2) << hex << rd_ value << ’\n’; 
cachefile << "  @ see. scerfiily 0.7 
<< setw(8) << hex << inst.da 
cc r¢ re 
<< setw(2) << hex << rd_ value << ‘\n’; 
Me 
ty (See el 
{ 
cachefile << " 2 "" <<"“serfil 10 
<< setw(8) << hex << inst.ia 
<< " " << setw(2) << hex << rsil_value 
cc A ‘fe 
<< setw(2) << hex << rd_value << ’\n’; 
cachefile <<" “[° S << serena.) 


}; 
} 
io 


recfile << 


recfile << 


<< setw(8) << hex << inst.da 
cc ee ef 
<< setw(2) << hex << rd value << ‘’\n’; 


"inst addr : " << hex << Instlwia << “(Mm 26s 2 ush. 
“data addr : " << hex << inst -dae<< “\nm 7<< fsa. 


recfile CO NRK KK KKKKAKKAKKAKAKKEKAKRAKARKRKRAN CS oN? cc flush; 


LE (Cem) 
( 


%$ interval == 0) 


cout << "\nDo you want to see register usage data? (y/n) "; 
cin >> view reg; 


82 


d; 


Peeve rege Y) aan (( (ceri) s interval) == 0)) 


( 


COouE << 
EG 
Gre 
a 
<a 


setiosflags(ios::left) << setw(10) << "Register" 


setw(1Z2) << 
setw(12) << 
setw(lZ2) << 
setw(12) << 


"Register" 
Source’ 
"Destination" 


"Average" oe ery 


Seen aay SCE1OSLIags(?0s;;i1eft)) << setw(l0) << "Number" 
mse tw(l2je << Symbol ” 
<< setw(1Z2) << "References" 
<< setw(12) << "References" 
<< setw(lZ) << "Lex. Distance" << '\n’; 
fom teat. 4 07 9 < S265 1F 7) 


{ 
Tee. Wee EO 

GOueC << RI << Ihe) << 
else 

EoOut 

COUE << 

<< 

<< 

<< 

<< 

<< 

ie 


eS wr, 
s 


Se << ge (2) << lS 
resetiosflags(ios::left) 
setw(12) << dec << reg sym{[i] 
setw(12) << dec << reg count [i].source 
setw(12) << reg count [{i].dest 
setiosflags (ios: : fixed) 
setw(12) << setprecision(1) 
Beg scOune/ij-avg dist << “\n’; 

}; 

cout << "\nNumber of instructions processed: 
}; 
if ((ctrtl1l) #* interval == 0) 


Pecaseh, fel << * \n > 


cout << ’\n’ << ctr?+l << " instruction records processed so far !!!"; 
cout << "\nDo you want to see register dependency data? (y/n) "; 
cin >> view dep; 


tes 


if ((view dep == ’y’) && ((ctrtl) % interval == 0)) 
{ 
cout << setiosflags(ios::left) << setw(10) << "Register" 
<< setw(12) << "Register" 
<< setw(lZ) "<<." Load" 
<< setw(lZ) <<” Source" 
<< setw(12) << "Average" 
<< setw(12) << "Percent" << ’\n’; 
cout << setiosflags(ios::left) << setw(10) << "Number" 
<< setw(12) << " Symbol" 
<< setw(1Z) << " Count 
<< Setwtiz) <<" Count” 
<< setw(12) << "Distance" 
<amsectwiilzje<< “Load Usetu<< “\n’> 
tot dep = 0.0; 
avg count = 0; 
BOR tet = 0 Pt Oe Tt) 
( 
Deh ot ci 10) 
Couto Rl! ect (i) ya< 4)"; 
else 
Coueraa. (Ry << ney) << 8) 
Gout << reseticsilags (ics: :left); 
CGOute-<« setw(iZ) << dec << reg sym[ij; 
cout << setw(12) << dec << reg dep[{i].load_count; 
if (reg _depf{i].ref count < 0) — 


cout cc re 
else 
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cout << setw(l2) << neg Gdéepia) aren icoune. 
cout << setiosflagsiiicos’, 27 ea), 
cout << setw(l2) << setereeicsi10on 
if (reg dep[i].avg dep dist < 0.0) 
Cout 6. (Fe ee 
else 
cout << reg dep[i].avg_dep dist, 
if (reg dep[i].load county< 77) 
cout << " RO eee 
else 
load percent = float (reg dep[i].load count) /float (loads); 
load percent = load_percent * 100.0; — 
cout << setw(1l2) << setprecision(1); 
cout << load percent <3. 7 


// ** calculate average dependency distance ** 
if (reg dep[i].avg_dep dist > 0.0) 


tot dep t= req Cep/2jeayceccesatce, 
avg_count+t; 


}; 


}e 
avg dist = tot_dep/avg count; 


cout << “Average load dependency distance is: " << avg dist << ’\n’; 
cout << "Total number of load instructions: “<< loads << “7 

COUE =<< 7 \n <<. ae 

cout << “\nNumber of instructions processed: " << Ctrtl <<) 7 
cout << "\nPress any key to continue: "“"; 


Cina sconce. 
he 
ctrt++,; 
infile.seekg(ctr*sizeof(inst),ios::beg); 
} 
infile.close(); 
cout << “"\nTotal number of instruction records in file is: " << Ctr=i-<4) ae 


} 


void check_reg dep(dependencyé reg, int& count, inté ref) 
{ 
void depcalc(dependencyé reg, inté& location, inté& ref); 
if (reg.id_used == 1) 
{ 


depcalc(reg, count, ref); 


}; 
} 


{fo ------- this functions calculates register usage data 
void regcalc(reg_dataé&é reg, inté& count) 
{ 
reg.lex dist = county — teg-astmie—, 
reg.last use = count; 
reg.tot dist += reg.lex dist; 
reg.avg dast = float (reg.tot dist)/float (req {sources seeg7ceoc, 


ee this functions calculates register dependency data 
void depcalc(dependencyé reg, inté& location, inté& ref) 
{ 
if (reg.ld_used == 1) 
{ 
if (ref == 1) 
{ 


reg.jast write = tecatten 
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ea 


ae 


reg.load_ counttt; 


else 
{ 
moe nog, last write i= —1) 
{ 
meg. ref count tt; 
Begereh O15. —slocatson = reg.iast write; 


reg.tot dist += reg.ref dist; 
meqg.avg dep dist = float (reg.tot_dist)/float (reg.ref count); 
reg.id used = 0; 


reg.last write = -l; 
aA 
is 
VF 
} 
[fo wrccccc nn this function clears the screen 
void clear screen(inté& i) 
{ 
mormei(? = lf i <= 26; ++i) 
@eme << ‘\n’; 
} 
[[o --------- this function accepts input from user to guide simulation 


Peeaestart up(char *ifile, char *ofile) 
{ 
@oue << "\nEnter address trace input file: "; 
eon >> ifile; 
cout << "\nEnter assembly code output file: "; 
Sim >> ofile; 
Seu << "\nCounting instructions...please wait..." << ‘\n’; 


[[o --c--- this function counts the total number of instructions in file 
eras instr count(char *afile, Instruction& instr, int& icount) 
{ 
fstream ifile; 
meount = 0; 
ifile.open(afile,ios::in|ios::nocreate) ; 
ifile.seekg(icount*sizeof (instr) ,ios::beg) ; 


while (ifile.read((char *) (&instr), sizeof(instr) )) { 
HeouUnt ++ > 
ifile.seekg(icount*sizeof (instr) ,ios::beg); 
Ve. if ((icount+1) % 1000 == QO) 
ed cout << ‘’\n’ << icount+l << " records counted" << ‘\n’; 


tHeile,close(); 
eenera- '\nlotal number of instruction records in file is: ": 
eeut, << icount << '\n’; 


35) 


APPENDIX D. RISC CACHE INTERFACE SIMULATOR (RICIS) CODE 


mT KKH HHH HHH HK KKH HHH HHH AAR A AAA AAHHH HHH HHRMA KARA HAHRAHA HHH HHH HHAKRAAAARAAH 


-- Thesis Project :RISC Cache Interface Simulator (RICIS) 

-- Author :Leonard Tharpe 

-—-— Date :September 1992 

-—-— System UNIX 

—— Compz ier :VERDIX Ada 

—— Description :This program is a simulation of a lockup-free 


== cache interface. It simulates the fetching and execution of a program 
-- trace. The trace input files are generated by the SPARC Address Trace 
== Transformer/Analyzer (SATTA) program. This program uses a generic queue 
= package, along with random number generator and hexidecimal-to-decimal 


== conversion packages. 
me OK HRHAKHKMRAAAKRAAKRAKAKAAAH NOTICE Piatt teeta tee ten fon hae tan, MAHAR HKKKKARKAKAHKAKRAKKAKKAAKRAAARAHRK 


«eee#e8 ep» # © @ # @ @ 


Out-of-Order fetching/execution is "partially" implemented. This feature 
an uses the Dependent Instruction Queue (DIQ). All references to DIQ apply 


a to this feature. 
—— me ee Oe Oe ee HH HHH HHH HHH HHH HK EHH HH HHH HH HHH HAHAHAHAHAHA AHAHHKHAAHAHE 


with TEXT IO, QUEUES, RANDOM, HEX; 
use TEXT IO, RANDOM, HEX; 


procedure SPLIT is 


package FLOAT INOUT is new FLOAT IO(FLOAT) ; 

use FLOAT INOUT; 

package INTEGER INOUT is new INTEGER IO(INTEGER); 
use INTEGER _INOUT; z 


--— This array defines the status of the registers used by the 

-- trace instructions. TRUE means the register is ready for use, 
-- FALSE means the register is blocked and cannot be used. 

type REGISTER STATUS is array (0..31) of BOOLEAN; 


-- The following is the format of the instruction from the address 

-- trace used by this program. CODE indicates what type of instruction, 
-- ADDRESS indicates the address from which the instruction is taken, or 
-—- to where the data is to be stored or retrieved from. 

type TRACE RECORD is 


record 
CODE ‘CHARACTER := ’ ’; 
ADDRESS :STRING(1..8) := (others =>’ "); 
SOURCE1 REGISTER :STRING(1..2) := (others => ’ ’); 
SOURCE2 REGISTER :STRING(1..2) := (others => ’ ‘’); 
TARGET REGISTER :STRING(1..2) := (others => ’ ‘); 


end record; 


—- This record defines the format for entries in the Memory Access 
-- Queue (MAQ). The TRACE LINE is the trace instruction, and the 
—- TIME VALUE hold the priority assigned to each MAQ entry. 

type MAQ RECORD is 


record 
TRACE LINE : TRACE RECORD; 
TIME VALUE ; INTEGER; 


end record; 
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-- This record defines the entries to the Priority Event Queue (PEQ). 


type EVENT RECORD is 


record 
EVENT_ID 
PRIORITY 

end record; 


"STRING (l..2) := (others =>" “); 
: INTEGER := 0; 


—-- Memory Access Queue 
package MAQ is new QUEUES (ITEM => MAQ RECORD); 


use MAQ; 


-—- Dependent Instruction Queue 
package DIQ is new QUEUES (ITEM => MAQ RECORD); 


use DIQ; 


-- Priority Event Queue | 
package PEQ is new QUEUES (ITEM => EVENT RECORD); 


use PEQ; 


-- Auxillary MAQ for viewing contents of the MAQ 
package MAQ VIEW is new QUEUES (ITEM => MAQ RECORD); 


use MAQ VIEW; 


-——- Auxillary PEQ for viewing contents of the PEQ 
package PEQ VIEW is new QUEUES (ITEM => EVENT RECORD) ; 


use PEQ VIEW; 


-- MAQ for load instructions in a split-queue configuration 
package Q0 is new QUEUES (ITEM => MAQ RECORD); 


use QO; 


-- MAQ for store instructions in a split-queue configuration 
package Ql is new QUEUES (ITEM => MAQ RECORD) ; 


use QI; 


HK KKEKKKAKEHAAAAKHAKKK HAH variable declarations HAKKAR KKK KRHA KKK 


ANOTHER 

DIQ FETCHED 
DIQ SIZE 
DIQ_USED 
BLANK 
BLOCKED 
BLOCKED REGISTER 
BLOCKS 
BR_CPI 

EVENT 

EVENTS 
EXECUTE TIME 


FILE NAME 
FINISHED 


HIT RATE 


HOLD VALUE 


:CHARACTER := ‘’y’; -- used to perform another simulation without re- 
—-- running program 
:BOOLEAN := FALSE; -- a flag that indicate if an instruction was taken 
Lome tie wD. CO 
PEOSTIIVE. == 350; --— the pre-set maximum size of the DIQ 
:CHARACTER := ’n’; -- a flag that indicates if the DIQ was used in the 
Simulation instead of stalling when an instruction 
depends on a queued request 
:STRING(1..2) := (others => ’ ’); -- used for checking a field to see if 
-- a register is used 
:BOOLEAN := FALSE; -- a flag that indicates if an instruction must be 
--— blocked for dependency 
REGISTER STATUS := (others => FALSE); -- an array of 32 flags that 
i -- indicates whether a register 
-- is blocked 
:DIQ.QUEUE(DIQ SIZE); -- the name of the entries of the DIO 
: INTEGER := 3; -- the number of cycles required for a branch instruction 
‘EVENT RECORD; -- a temporary store for a PEQ record 
:POSITIVE := 100; -- the pre-set maximum size of a PEQ 
: INTEGER := 0; -- the cumulative execution time (in cycles) of a 
-- simulation session 
MOnRING (12.30) -= (others => ’ “j); 
fSeGeePAN j= PALSE, —— this flag indicates that both the instruction file 
-- and PEQ are empty 
‘FLOAT := 90.0; -- % the percentage of load instructions that are cache 
== hits 
: INTEGER := 0; -- temporary storage for queue position counter 
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INPUT FILE :FILE TYPE; -= trace file from SATA 

INTERVAL :INTEGER := 0; -- the instruction count intervals in which to obt 
interim results 

-- LATENCY is the penalty assessed to read misses and writes 

LATENCY : INTEGER := 50; -- preset main memory access penalty in cycles 

LOAD DEP : INTEGER := 1; -- this is the register number of an instruction t 

7 —-- immediately follows a load instruction and 

-~- references the destination register of the li 


LOAD REG : INTEGER := 1; -- this is the destination register of a load instruct 
-—-— cache hit or miss. 
LOAD_ SWITCH :BOOLEAN := FALSE; -- this flag indicates that the issued instructi« 


-- a load and the next instruction’s source 
-- register(s) must be checked for load depenc 
MAIN MEMORY :MAQ RECORD; -- store for the main memory Simulation. Since 
= ——- one instruction 
—-- can be in main memory at any one time, onls 
-- MAQ record size is needed 


MAQ SIZE :POSITIVE := 100 -- preset maximum size of the MAQ 
MAQ_ COUNT :MAQ_ VIEW.QUEUE (MAQ SIZE); -- entry names of a temporary MAQ used to 
== CoOncenus 

MAQ FULL :BOOLEAN := FALSE; -- flag indicating that MAQ is full 

MAQ LENGTH : INTEGER := 0; ~— current size of the MAQ. 

NAME LENGTH : INTEGER := 0; --— length of filename entered by user. Derived ! 

-- interal function 
PEQ COUNT :PEQ VIEW. QUEUE (PEQ SIZE); -- entry names of tempory PEQ to view cont 
== Of (Phe 

PEQ SIZE :POSITIVE := 200; -- the pre-set maximum size of the PEQ 

Q READS : INTEGER ;:= 0; -- the number of read or load instructions in the M 
Q WRITES : INTEGER := 0; -- the number of write or store instructions in the 
Q0_ SIZE :POSITIVE := 20; -- the pre-set maximum size of the QO 

Ql SIZE :POSITIVE := 20; -- the pre-set maximum size of the Qi 

READ PRI : INTEGER := 0; -- the priority value assigned to read or load mis 
READS °:Q0.QUEUE (QO SIZE); -- the name of the elements in QO 

RECORD COUNTER : INTEGER := 0; -- accumulates the number of instructions read from ad 

-- trace file 

RECORD _ REMOVED :MAQ RECORD; -- temporary storage for elements removed from the MA 
REQUESTS :MAQ. QUEUE (MAQ SIZE); 

RESPONSE :CHARACTER := ’y’; -- used to get user yes/no response 

SAME DATA :CHARACTER := ’n’; -—- for using the same parameters of a previous ses 
SEPERATE Q :CHARACTER := ’n’; -- value that determines whether to use a sing. 

-- seperate MAQ scheme 
TOTAL CYCLES : INTEGER := 0; -- accumulates the total number of cycles of a sess 
TOTAL PENALTY : INTEGER := 0; -- accumulates the total cost of main memory access di 
—- a session 

TRACE :MAQ RECORD; -- hold data in the format of an MAQ entry 

TRACE REC : TRACE RECORD; -- formatted store of line from the address trace f 
TYPE Q :CHARACTER := ’f’; -- the MAQ scheme to simulate: ’f’-FIFO, ’p’ — Pri 
UPDATE RECORD :MAQ RECORD; -- temporary store for MAQ entry 

VIEW :CHARACTER := ’n’; -- value that determines whether to show cycle-by- 

——- simulation 
WAITING :PEQ.QUEUE (EVENTS); -- the name of the elements in the PEQ 
WRITE PRI : INTEGER := 0; -- the priority value assigned to write or store 
a -~- instructions 
WRITES :Q1.QUEUE(Q1 SIZE); —-—- the name of the elements in QI1 
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—-— This procedure clears the CRT screen 
Procedure CLEARSCREEN is 
begin 
PUTIASCTI IT. £50) 
PUR Cae). 
end CLEARSCREEN; 
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~- This procedure reads in file name and opens the input file 
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procedure GET INPUT FILE(INPUT FILE PMeOuCwen tLe TYEE). 15 


FILE NAME STRING (1. . 30) ; 
NAME LENGTH : INTEGER; 
begin 


PUT LINE ("Enter name of file to parse "); 
GET LINE (FILE NAME,NAME LENGTH) ; 
OPEN(INPUT FILE, MODE => IN FILE, NAME => FILE_NAME(1..NAME LENGTH) ) ; 
-—— The simulator is initiated by inserting the first event into 
-— the PEQ. The first event being to issue an instruction (ii). 
PEQ.CLEAR (WAITING) ; 
Poem ueeveNr ID ss “ii"; 
EVENT.PRIORITY := 0; 
PEQ.ADD (EVENT, WAITING) ; 
enagectietNPUT FILE; 


-- This procedure allows the user the option of viewing the cycle- 
-- by-cycle transactions of the simulation of to execute 

-- without displaying transactions. It also allows the user 

-— to view interim results. 

procedure GET VIEW METHOD is 


begin 


NEW LINE; 
PUT("Do you want to view Queues after every activity ? (y/n).."); 
GET (VIEW) ; 

NEW LINE; 

PUT("Enter interval value for viewing Queues: "); 

GET (INTERVAL) ; 

NEW LINE; 


end GET _VIEW METHOD; 


-—-— This procedure lets the user set the parameters for the 
=— target configuration. 
Procedure GET INITIAL DATA is 


begin 
NEW LINE; 
PUT("Enter simulated CACHE HIT RATE: ne 
GET (HIT RATE); 
NEW LINE; 
feiaeeermulate FIFO or Priority Queue? (f/p) : "); 
GET(TYPE Q); 
NEW LINE; 
PUT("Enter Memory Queue Size: "); 
GET (MAQ SIZE); 
NEW LINE; 
if TYPE Q = ‘p’ then 
PUT("Use seperate memory queues for Reads and Writes? (y/n) : "); 
GET (SEPERATE Q); 
if SEPERATE 0 = ’y’ then 
PUT("ENTER READ QUEUE SIZE: "); 
GET(Q0 SIZE); 
NEW LINE; 
PUT("ENTER WRITE QUEUE SIZE: "); 
GET(Q1 SIZE); 
NEW LINE; 
end if; 
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PUT("Enter Read Priorrey (0 CC.) eee 
GET (READ PRI); 
NEW LINE; 
PUT("Enter Write Priority (0 tool) ee. 
GET (WRITE PRI); 
NEW LINE; 
end 28; 
NEW LINE; 
PUT("Do you want to use Dependent Instruction Queue? (y/n) : "); 
GET (PO Le2USED) | 
NEW LINE; 
PUT("Enter number of stall cycles for Load dependency: "); 
GET (LOAD DEP); 
GET VIEW METHOD; 
MAQ. CLEAR (REQUESTS) ; 
end GET INITIAL DATA; 


-- This procedure parses the lines from the address trace file. -Each 


—-- line represents an instruction. The instruction is broken down 
—- into components. 
procedure GET FIELDS (PARSE LINE -in out STRING, 
NR_OF CHARS IN LINE :in out INTEGER; 
TRACE _ REC :in out TRACE RECORD) is 
VALID ADDRESS, 
VALID CODE :BOOLEAN := FALSE; 
begin 
TRACE REC.CODE := PARSE LINE (3); 
TRACE REC.ADDRESS := PARSE LINE(6..13); 
TRACE REC.SOURCE1 REGISTER := PARSE LINE(16..17); 
TRACE REC.SOURCE2 REGISTER := PARSE LINE(20..21); 


TRACE REC.TARGET REGISTER := PARSE LINE(24..25); 
end GET FIELDS; 
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-—-— Parses the lines read in from the input file. 


procedure DO LINE PARSING(INPUT FILE : in out FILE TYPE; 
TRACE REC  :in out TRACE RECORD) is 
PARSE LINE :STRING(1..80) := (others => ’ ’); 


NR_OF CHARS IN LINE : INTEGER; 


begin 
GET LINE(INPUT FILE,PARSE LINE,NR OF CHARS IN LINE); 
GET FIELDS (PARSE _ LINE, NR OF _ CHARS _ IN_ LINE, TRACE | REC); 
TRACE.TRACE LINE := TRACE _ REC; 

end DO_LINE_ PARSING; 


-—- This procedure is a viewing option for the users, allowing the 
—-- viewing of the status of each register. 
procedure VIEW _REGISTER_STATUS is 


COL : INTEGER; 
begin 
PUT (" =2=====23- = Sa a re ") 
NEW LINE; 
PUTT 20 1 Z 3 4 5 6 Z 8 9 10 11 12 13 14 i153 
NEW LINE; 
fOr elim Oe) oS eloor 
if BLOCKED REGISTER(I) then 
PULP = oe 
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else 
PUT (" oe 
ena if; 
end loop; 
NEW LINE; 


NEW LINE; 


Pimp te 77 38 19 20 21 22 23 24 25 26 27 28 29 30 31 '/"); 


NEW LINE; 
Meer in 16..31 Loop 
if BLOCKED REGISTER(I) then 
mo," ox") s 
else 
eur (' ag ; 
end if; 
end loop; 
NEW LINE; 


-- This procedure allows the viewing of the contents of the PEQ. 


procedure VIEW MAQ is 


| LENGTH : INTEGER; 
MAQ HOLD :MAQ RECORD; 
VIEW :CHARACTER := ’n’; 
KEY : CHARACTER; 
begin 


Raa RRR RRA A RAK VIEW OUTSTANDING MEMORY REQUEST QUEUE 


NEW LINE; 
LENGTH := MAQ.LENGTH_ OF (REQUESTS) ; 
Peewee ATE QO = ’n’ then 

SET COL(20); 


PUT LINE ("#4 #AAKHHAKAHAAAMAO HH AHHH AAA RA HHA AY 


SeimeCOL (20); 


PUT LINE("* CODE ADDRESS PRIORITY* "); 
SET COL (20); 
PUT LINE ("*-------------------------------- ar) 


PUT ("MAIN MEMORY -—>"); 

moerer in i..LENGTH loop 
MAQ. REMOVE (MAQ_ HOLD, REQUESTS) ; 
SET _COL(20); 


PUT("* jc, 

PUT(MAQ HOLD.TRACE LINE.CODE); 

PUT ( ee | s 

PUT (MAQ HOLD.TRACE LINE.ADDRESS) ; 
PUT (" a 2 

PUT(MAQ HOLD.TIME VALUE, WIDTH => 3); 
PUT LINE(" ett) 


MAQ.ADD(MAQ HOLD, REQUESTS) ; 
Shame OlL (20) ; 
PUT LINE ("**#RHHHKKKHHHHEH HHH HHH HHH HH KHAAAKRKN) ; 
end loop; 
NEW LINE; 
else 
BEE COL(20) ; 
PUT LINE ("******* MAIN MEMORY ***++*+**") ; 
SETUCOL (20) ; 


ol 


KAKA KKKKEAKRAKKEKEAKAKKH 


PUT _LINE("* CODE ADDRESS *"); 
SET COL(20); 


PUT LINE ("*-----~~-----~-----~~-~------ sy 
SET COL( 20), 

PUT ae 

PUT (MAIN MEMORY.TRACE_ LINE.CODE); 
PUT. we 

PUT (MAIN MEMORY. TRACE _LINE.ADDRESS) ; 
PUT LINE (|) ees “ay 


SET) COL (20) 

PUT LINE (UR RRA RAH ARAHRARAAHARAKAKAAHREN) ; 

NEW LINE; 

SET COL (20) + 

PUT LINE ("*****"nn4% READ QUEUE ******AeAeee4eN) 
SET COL(20); 


PUT _LINE("* CODE ADDRESS PRIORITY= jy 
SET COL(20); 
PUT LINE ("*-------------------------------- amt) 


for I in 1..(Q0.LENGTH OF (READS)) loop 
Q0.REMOVE (MAQ_ HOLD, READS) ; 
SET VeCOhd 20) 
PUT("* oe 
PUT (MAQ HOLD.TRACE LINE.CODE) ; 
PUT (s a 
PUT (MAQ_ HOLD.TRACE_LINE.ADDRESS) ; 
PUT(" ”) ; 
PUT (MAQ HOLD.TIME VALUE, WIDTH => 3); 
PUT EINE ( s "4 sue 
SET COLY{20) ¢ : 
PUT LINE ('** #4 R AH HAAHHHHHKAAKHRHKAKAKKEAAH KHAKI) Ai 
Q0.ADD(MAQ HOLD, READS) ; 
end loop; 
if QO.IS EMPTY(READS) then 
Shmsconrc 0), 
PUT LINE("* (empty) ae 
SET COL(20); 
PUT LINE ("**HAAHKHHH HAH KH HHH HAHAHAHA HAHAHAHA) i 
end if; 
NEW LINE; 
SETeCGOL (20); 
PUT LINE ("****44444% WRITE QUEUE Aa KAAKAKAL AHN)» 
SET COL(20); 


PUT_LINE("* CODE ADDRESS PRIORITY* "); 
SET COL (20); 
PUT LL INE ("42S “my 


for I in 1..(Q1.LENGTH_OF(WRITES)) loop 
Q1.REMOVE (MAQ_ HOLD, WRITES); 
SET _COL(20) ; 


PUT ( 08 ot aN) ; 

PUT (MAQ HOLD. TRACE LINE.CODE); 

PUT G. ee 

PUT (MAQ HOLD.TRACE LINE.ADDRESS) ; 
PUT ss ; 

PUT(MAQ HOLD.TIME VALUE, WIDTH => 3); 
PT ai Tie ee ac ti 


SET GOEL 20)7 
PUT LINE (U*AAAARAAAAAHAHHHAHEHAKEHKHAAHRHAKAKHAAINY | 
Q1.ADD(MAQ HOLD, WRITES) ; 

end loop; 

if Q1.IS EMPTY(WRITES) then 
SET COL (20); 
PUT Dene (= (empty) wi) 
SET COL( 207 
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PUT LINE (#4 HHH HAHAHAHA HR KHAHKAHKHE ARH AAKHAAHAN YS » 


end if; 
NEW LINE (2); 
end if; 
end VIEW MAQ; 
-—— This procedure allows the viewing of the contents of the DIQ 
procedure VIEW DIQ is 


LENGTH INTEGER; 

DIQ HOLD :MAQ RECORD; 

VIEW CHARACTER = “4n’ 5 

KEY. >: CHARACTER; 

begin 
— HK HK KHKKKHKHKKAKKHH HK VIEW Dependent INSTRUCTION QUEUE aAwKeAKKK KKK KKH HKEHKAKKAAKAH 

NEW LINE; 
DENGTH := DIQ. LENGTH _OF (BLOCKS) > 


Set COL(20); 


PUT manera RA ANRA AAT) TORK RAAARAARAHKAAARAE AAT ) ; 


BPETEGCL (20) ; 


PUT LINE(“* CODE ADDRESS aie en SLRS! eA) 7 
SET COL(20); 
PUT LINE ("*--------------~---------------------------- aM) 


for I in 1..LENGTH loop 


DIQ.REMOVE (DIQ HOLD, BLOCKS) ; 
Se COL(20); 


PUT( th a0) ; 

PUT (DIQ_ HOLD.TRACE LINE.CODE) ; 

PUT ( nt) ; 

PUT (DIQ HOLD .TRACE LINE.ADDRESS) ; 

PUT ( i) ; i : 

i210 0§ (DIQ HOLD. LTT VALUE Wile th => Sj 
Boe CO"); 


if DIQ HOLD.TRACE LINE.SOURCE1 REGISTER /= BLANK then 


PUT (HEX_TO_INTEGER(DIQ_HOLD.TRACE_LINE.SOURCE1_REGISTER) , WIDTH 


else 

PUT ( te y Z 
end if; 
PUT ( te i) 


if DIQ HOLD.TRACE LINE.SOURCE2 REGISTER /= BLANK’ then 


PUT (HEX_TO_INTEGER(DIQ_HOLD.TRACE_LINE.SOURCE2_REGISTER) , WIDTH 


else 

POUT (" ) i. 
ena if; 
ReteLINE(" * "); 
DIQ.ADD(DIQ HOLD, BLOCKS) ; 
SET COL(20); 


PUT LINE ( URAAAHAKAKHAHAAHKRANANAAKAAKKHKAHKKHAKKHEKAAKAKAAKAR J & 


end loop; 
NEW LINE; 


end VIEW _DIQ; 


Se, 


=> 2); 
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-—-— This procedure allow the viewing of the PEQ contents. 
procedure VIEW PEQ is 


LENGTH >; INTEGER; 
PEQ HOLD :EVENT_ RECORD; 
) 
begin 
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SET COb(2o) 
PUT LINE ("*****#**# te PEQ ***teeeeee aN) - 
SERSCODZO) 


PUT LINE("* EVENT TIME *"); 
SET COL(20); 

PUT LINE("* aaa aa ee ee an) 
LENGTH := PEQ.LENGTH OF (WAITING) ; 


if not PEQ.IS EMPTY(WAITING) then 
for I in 1..LENGTH loop 
PEQ. REMOVE (PEQ HOLD, WAITING) ; 
SET sGOL(20)- 
PUT (= =) Ff 
PUT(PEQ HOLD.EVENT ID); 
PUT (" ae 
PUT(PEQ HOLD.PRIORITY, WIDTH => 5); 
PUT LINE(" Ags 
SET 160i. (20); 
PUT LINE ("*AHAA AHHH HAKHKAERKHAAAE HH KN ; 
PEQ.ADD(PEQ HOLD, WAITING) ; 
end loop; 
end if; 
end VIEW PEQ; 


-- This procedure controls the viewing of all queues and provides 
-- interim results. 
procedure VIEW RESULTS is 


VIEW :CHARACTER := /n’; 
CPI VALUE : FLOAT; 
begin 
NEW LINE; 


PUT ("INSTRUCTION COUNT: "); 

PUT (RECORD COUNTER, WIDTH => 1); 

NEW LINE; — 

PUT ("PROGRAM EXECUTION TIME IN CYCLES: "); 

PUT(EXECUTE TIME,WIDTH => 1); 

NEW LINE; 

CPI VALUE := FLOAT (EXECUTE TIME) /FLOAT (RECORD COUNTER) ; 
PUT ("CURRENT CPI VALUE: "); ~ 

PUT (CPI VALUE, FORE => 3, AFT => 2, EXP => 0); 

NEW LINE; 


end VIEW RESULTS; 


-- This procedure displays the contents of each address trace line (record) 
procedure VIEW TRACE LINE (TRACE :in MAO RECORD) is 


begin 
if VIEW = ’y’ then 
PUT(" INSTRUCTION FETCH: "); 
PUT (TRACE.TRACE LINE.CODE); PUT(" "); 
PUT (TRACE.TRACE LINE.ADDRESS) ; 
NEW LINE; 7 
end if; 


end VIEW TRACE LINE; 


—_———— mmm cm co cme cm cm ce ce cm mc cm cm me ee ec ce ee ce ee ee ee i i i ese 
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-—- This procedure puts entries into the MAQ 
procedure ENTER MAQ(TRACE :in out MAQ RECORD) is 


begin 


—- PUT_LINE("ENQUEUING MAQ") ; 
MAQ. ADD (TRACE, REQUESTS) ; 


-— if the entry is a load instruction, the destination register is 
-—- marked as blocked. 
if TRACE.TRACE LINE.CODE = ’0’ then 


BLOCKED REGISTER(HEX TO INTEGER(TRACE.TRACE_LINE.TARGET REGISTER)) := TRUE; 
Q READS := Q READS + 1; - 

else 
Q WRITES := Q WRITES + 1; 

end if; 


end ENTER _MAQ; 


-- This procedure puts entries into the LOAD queue 
procedure ENTER QO(TRACE :in out MAQ RECORD) is 


begin 


—— PUT LINE("ENQUEUING ‘read Q’ "); 
Q0.ADD (TRACE, READS) ; 
BLOCKED REGISTER (HEX TO INTEGER(TRACE.TRACE LINE.TARGET REGISTER)) := TRUE; 
Q READS := Q READS + I; - - 
end ENTER QO; 


—--— This procedure put elements into the STORE queue 
procedure ENTER _QI(TRACE :in out MAQ RECORD) is 


begin 


=== PUT LINE ("ENQUEUING ’write Q’ "); 

Q1.ADD (TRACE, WRITES); 

Q WRITES := Q WRITES + 1; 
end ENTER Ql; 
—-— This procedure puts entries into the PEQ 
procedure ENTER_PEQ(EVENT :in out EVENT RECORD) is 


begin 
PEQ.ADD (EVENT, WAITING) ; 
end ENTER PEQ; 


—-— This procedure puts elements into the DIQ 
procedure ENTER DIQ(INSTRUCTION :in out MAQ RECORD) is 


_ begin 
—--— PUT LINE ("ENQUEUING DIQ"); 
DIQ.ADD (INSTRUCTION, BLOCKS) ; 
—-— BLOCKED REGISTER (HEX TO INTEGER (INSTRUCTION. TRACE LINE.TARGET REGISTER) ) 
-— := TRUE; etal 7 . 
end ENTER_DIQ; 
—-— This procedure take instructions from the MAQ 
meocedure SERVE MAQ(TRACE :in out MAO RECORD) is 


| TARGET :MAQ_ RECORD; 
MAQ HOLD :MAQ_ RECORD; 


25 


Q LENGTH : INTEGER; 


HI POSITION :INTEGER := 0; 
MAQ POSITION : INTEGER := 0; 
QUEUE HEAD : INTEGER := 999999; 
FOUND | :BOOLEAN := FALSE; 
begin 
MAQ. REMOVE (MAQ_ HOLD, REQUESTS); —-- request leaving main memory 
Q LENGTH := MAQ.LENGTH OF (REQUESTS); -- getting the length of the queue 


me es cs we i sc me ee ee ec em me me ce ce ee i ce eee = 


-- If the removed instruction is a load instruction, the destination 
-- is unblocked. 
if MAQ HOLD.TRACE LINE.CODE = ’0’ then 
BLOCKED _REGISTER (HEX_TO_INTEGER(MAQ_HOLD.TRACE_LINE.TARGET REGISTER) ) 
:= FALSE; 


—-- this statement isfor determining the next item by priority to 
-- remove from the queue. The target item is identified by its 
-- position in the queue. Each item is removed, compared, and 
-- re-entered into the queue. 
if not MAQ.IS EMPTY(REQUESTS) then 
for L\in’ 1. .C LENGTH Jeop 

HI POSITION := HI POSITION + 1; 

MAQ.REMOVE (MAQ HOLD, REQUESTS) ; 

if MAQ HOLD.TIME VALUE < QUEUE HEAD then 

QUEUE HEAD := MAQ HOLD.TIME VALUE; 


MAQ POSITION := HI POSITION; 
TARGET := MAQ HOLD; 
end if; = 
MAQ.ADD (MAQ HOLD, REQUESTS); 
end loop; 
HOLD VALUE := TARGET.TIME VALUE; 
if VIEW = ‘y’ then i. 
NEW LINE; 
PUT ("ENTERING MAIN MEMORY: "); 
PUT (TARGET.TRACE LINE.CODE); PUT(" "); 
PUT (TARGET. TRACE LINE.ADDRESS); PUT(" "); 
PUT (TARGET.TIME VALUE,WIDTH => 1); 
end if; - 


—=—- The value of MAQ POSITION is Che pos2eionein Ene gue Cm a 
—- item to be removed. 
HI POSITION := 0; 
MAQ. ADD (TARGET, REQUESTS) ; 
for I in i..@ LENGTH loop 
MAQ.REMOVE (MAQ HOLD, REQUESTS) ; 
HI POSITION := HI POSITION + 1; 
if MAQ POSITION /= HI POSITION then 
MAQ.ADD(MAQ HOLD, REQUESTS) ; 
end if; 7 
end loop; 


—- Numerical data collection statements 
if TARGET.TRACE LINE.CODE = ’0’ then 


Q READS := Q READS - 1; 
TOTAL CYCLES += TOTAL CYCLES * DATENE™. 
else 7 
Q WRITES := Q WRITES - 1; 
TOTAL CYCLES := TOTAL CYCLES + LATENCY; 
end if = i 
end if; 
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end SERVE MAQ; 

—— This procedure removes items from the DIQ. This is a FIFO 
—— queue, thus the generic REMOVE function is used. 

procedure SERVE BIQ(TRACE :in out MAQ RECORD) is 


TARGET :-MAQ RECORD; 

DIQ HOLD :MAQ RECORD; 

Q LENGTH : INTEGER; 
begin 


if not DIQ.IS_EMPTY(BLOCKS) then 
DIQ. REMOVE (TARGET, BLOCKS) ; 
if TARGET.TRACE LINE.SOURCE1_REGISTER = BLANK then 


Lf Bee ED ) REGISTER (HEX_ TO_ INTEGER (TARGET. TRACE LINE.SOURCE2 REGISTER) ) 


= FALSE then 
DIQ_ FETCHED := TRUE; 
-- The removed instruction becomes the active instruction baa 
TRACE := TARGET; 
-- Unblocks the destination register 
BLOCKED REGISTER (HEX TOe INTEGER (TARGET.TRACE LINE.TARGET REGISTER) ) 
= FALSE; i a 
if VIEW = ’y’ then 
NEW LINE; 
PUT _LINE("FETCHING FROM DIQ"); 
PUT (TARGET. TRACE LINE.CODE); PUL “oe 
PUT (TARGET.TRACE LINE.ADDRESS); PUT(" "); 
PUT (TARGET.TIME VALUE,WIDTH => 1); 
end if; 
else 
-—— Put removed item back into DIQ 
DIQ.ADD (TARGET, BLOCKS) ; 
DIQ- FETCHED := FALSE; 
end if; 
end if; 
if TARGET. TRACE _ LINE.SOURCE2 REGISTER = BLANK then 


if BLOCKED _ REGISTER (HEX_ Toy INTEGER (TARGET.TRACE_ LINE.SOURCE1 REGISTER) ) 


= FALSE then 

DEQeeEICHED ;= TRUE; 

-—- The removed instruction becomes the active instruction executed 
TRACE := TARGET; 

-- Unblocks the destination register 

BLOCKED ) REGISTER (HEX_ TO. INTEGER (TARGET. TRACE _ LINE .TARGET "_ REGISTER) ) 


:= FALSE; 
ie VIEW = Evo then 
NEW LINE; 
PUT_LINE("FETCHING FROM DIQ"); 
PUT (TARGET.TRACE LINE.CODE); PUT(" "); 
PUT (TARGET.TRACE_LINE.ADDRESS); PUT(" "); 
PUT (TARGET.TIME VALUE,WIDTH => 1); 
end if; - 


else 
-—— Put removed item back into DIQ 
DIQ.ADD (TARGET, BLOCKS) ; 
DIQ FETCHED := FALSE; 
end if; 
ema if; 
if TARGET.TRACE LINE.SOURCE1 REGISTER /= BLANK and 
TARGET.TRACE LINE. SOURCEZ | REGISTER /= BLANK then 


if BLOCKED ) REGISTER (HEX_ TOF INTEGER (TARGET.TRACE LINE.SOURCE1 _ REGISTER) ) 


= FALSE 


and BLOCKED_REGISTER (HEX _TO_INTEGER (TARGET.TRACE LINE.SOURCE2_ REGISTER) ) 


= FALSE then 


oF. 


DIO FETCHED |: = TRUE 

-—— The removed instruction becomes the active instruction executed 
TRACE := TARGET; 

-- Unblocks the destination register 

BLOCKED_REGISTER (HEX _ TO_ INTEGER (TARGET. TRACE LINE. TARGET REGISTER) ) 


:= FALSE; 

if VIEW = ’y’ then 
NEW LINE; 
PUT LINE("FETCHING FROM DIQ"); 
PUT(TARGET.TRACE LINE.CODE); PUT(" "); 
PUT (TARGET.TRACE LINE.ADDRESS); PUT(" "); 
PUT (TARGET.TIME VALUE,WIDTH => 1); 

ehani se, 


else 
——- Put removed item back into DIQ 
DIQ. ADD (TARGET, BLOCKS) ; 
DiQVFrETCASD :=@ FALSE; 

end if; 

end 15, 
end if; 
end SERVE DIQ; 
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-- This procedure removes entries from the LOAD MAQ. The items are 
-- removed FIFO. 
procedure SERVE QO(TRACE :in out MAQ RECORD) is 


TARGET :MAQ RECORD; 

Q0_ HOLD :MAQ RECORD; 

Q LENGTH : INTEGER; 
begin 


if not Q0.IS_EMPTY(READS) then 
Q0. REMOVE (TARGET, READS); 
1f VIEW = ’y’ then 


NEW LINE; 
PUT LINE("FETCHING FROM READ Q"); 
PUT (TARGET. TRACE LINE.CODE); PUT(" "); 
PUT (TARGET.TRACE LINE.ADDRESS); PUT(" "); 
PUT (TARGET. TIME VALUE,WIDTH => 1); 
end 26; 7 
TRACE := TARGET; 
BLOCKED_REGISTER (HEX TO INTEGER (TARGET.TRACE LINE.TARGET REGISTER)) := FALSE; 
end if; 


end SERVE Q0; 


me ee ee ee ee ee me ee ce re cs es ee Ss cs ce ee ee ee ee 


-—- This procedure removes entries from the STORE MAQ (FIFO). 
procedure SERVE Q1 (TRACE rinpe-out MAQ RECORD) is 


TARGET :MAQ RECORD; 

Ql HOLD :MAQ_ RECORD; 

Q LENGTH : INTEGER; 
begin 


if not Q1.IS_EMPTY(WRITES) then 
Q1. REMOVE (TARGET, WRITES) ; 
if VIEW = ’y’ then 
NEW LINE; 
PUT LINE("FETCHING FROM WRITE Q"); 
PUT(TARGET.TRACE_ LINE.CODE); PUT(" "); 
PUT(TARGET.TRACE LINE.ADDRESS); PUT(" "); 
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PUT (TARGET.TIME VALUE,WIDTH => 1); 
end if; 5 
TRACE := TARGET; 
ena if; 


end SERVE Q1; 


—— This procedure takes items from the PEQ. This is a priority 
—— queue, so the item to be remove is identified by its priority 


—-— value. 
procedure SERVE PEQ (TARGET senor EVENT RECORD) is 
—— TARGET : EVENT RECORD; 
EVENT HOLD : EVENT RECORD; 
HI POSITION : INTEGER := 0; 
EVENT POSITION : INTEGER := 0; 
QUEUE HEAD : INTEGER := 999999; 
FOUND :BOOLEAN := FALSE; 
begin 


PEQ VIEW.CLEAR(PEQ COUNT); 
while not PEQ.IS EMPTY(WAITING) loop 
HISPOSITION := HI POSITION + i; 
PEQ. REMOVE (EVENT HOLD, WAITING) ; 
if EVENT HOLD.PRIORITY < QUEUE_HEAD then 
QUEUE HEAD := EVENT HOLD.PRIORITY; 
EVENT POSITION := HI POSITION; 
TARGET := EVENT HOLD; 
ena if; So 
PEQ VIEW.ADD(EVENT_HOLD,PEQ COUNT) ; 
end loop; 


HOLD VALUE := TARGET.PRIORITY; 

if VIEW = ’y’ then 
NEW LINE; 
PUT ("ITEM SERVICED : "); 
PUT (TARGET.EVENT ID); PUT(" "); 
PUT (TARGET.PRIORITY,WIDTH => 1); 
NEW LINE; 

end if; 


PEQ. CLEAR (WAITING) ; 


HI POSITION := 0; 
while not PEQ VIEW.IS EMPTY(PEQ COUNT) loop 
PEQ VIEW.REMOVE (EVENT HOLD, PEQ COUNT) ; 
HI POSITION := HI POSITION + 1; 
if EVENT POSITION /= HI POSITION then 
PEQ.ADD(EVENT HOLD, WAITING) ; 
end if; * 
end loop; 


end SERVE_PEQ; 

Eetas Procedure displays interim results of the simulation. 

procedure INTERVAL CHECK (INTERVAL sin out INTEGER; 
MAGWLENGTH ;in Out INTEGER) is 


begin 
if (RECORD COUNTER mod INTERVAL) = 0 then 
NEW LINE; 
MAQ LENGTH := MAQ.LENGTH OF (REQUESTS) ; 


PUT("Number of records processed: "); 
PUT (RECORD_COUNTER, WIDTH => 1); 
NEW LINE; 


oo 


PUT ("NUMBER OF RECORDS IN MAQ: ae). 
if MAQ LENGTH /= 0 then 
PUT( (MAC LENGTA=1) 7 WIDT i= 
else 
PUT (MAQ_ LENGTH, WIDTH => 1); 
end if; 
NEW LINE; 
VIEW QUEUES; 
PUT ("Do you want to continue with simulation? (y/a) =) 
GET (RESPONSE) ; 
NEW_LINE; 
end *1e 
end INTERVAL CHECK; 
-—- This procedure checks the status of a particular register. TRUE 
—- means the register is available for access, FALSE means the register 
-—- is blocked and awaiting new data. 
procedure CHECK BLOCKED REGISTER(REGISTER :in out STRING) is 


& 


REG_NO ; INTEGER; 
begin 


if BLOCKED REGISTER (HEX_TO_INTEGER(REGISTER)) then 
—- ENTER_DIQ(TRACE) ; 
BLOCKED := TRUE; 

end if; 


-- The next statements determine if the instruction immediately following 
-- a LOAD instruction requires the data from the destination register. 
-- The LOAD SWITCH is set when a load instruction occurs. If the switch 
-- is set (TRUE) then the following instruction’s source registers are 
-—-— are checked for dependency against the previous load. If there is 
-- a load dependency, the execution time is incremented by the amount 
--— of the dependency penalty (simulates a stall) 
i LOAD_SWITCH and LOAD_REG = HEX _TO_INTEGER (REGISTER) then 
EXECUTE TIME := EXECUTE TIME + LOAD DEP; 
if VIEW = ’y’ then 
PUT("Load dependency stall.."); 
PUT (LOAD DEP,WIDTH => 1); 
PUTSEINE ("= cycies 7 
end if; 
end 22, 
end CHECK BLOCKED REGISTER; 


-- This procedure handles a MAQ full situation. 
procedure CHECK MAQ FULL is 


CHOICE : INTEGER; 
N : INTEGER; 
begin 


SERVE MAQ (TRACE) ; 
if VIEWAee’y’ then 
VIEW QUEUES; 
NEW LINE; 
end if; 

end CHECK_MAQ FULL; 
-—- This procedure handles situations when the processor stalls 
—--— because of data dependency; required data is in the MAQ. 
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a acl 


-- In essence the procedure performs "Im" (leave memory) events 
-- until the required data is available. 
procedure LM2 is 


TRACE2 :MAQ RECORD; 
begin 


—-— if the MAQ has separate queues for loads and stores, the 
-- procedure ensures the correct queue is served. 
Peer eRATE OQ = “y’ then 


-- If LOADS have a higher priority (lower value) then serve the 
-~- load queue first. If the load queue is empty, then proceed to 


-- the store queue. If both are empty, then serve main memory queue, 


~~ which means that the needed item is currently retrieving data 
-~- from main memory. Registers released from main memory are 
~~ unblocked, as usual. 
if READ PRI < WRITE PRI then 
if not Q0O.IS EMPTY(READS) then 
SERVE QO (TRACE2) ; 
MAIN MEMORY := TRACE2; 
elsif not Q1.IS EMPTY(WRITES) then 
SERVE Q1(TRACE2) ; 
MAIN MEMORY := TRACE2; 


else 
if MAIN MEMORY.TRACE LINE.CODE = ‘0’ then 
BLOCKED_REGISTER (HEX TO INTEGER (MAIN MEMORY. 
TRACE_LINE.TARGET REGISTER)) := FALSE; 
end if, 
MAIN MEMORY.TRACE LINE.CODE t=" 
MAIN MEMORY.TRACE LINE.ADDRESS := " (EMPTY) "; 
end if; = 
else 


-- Else STORES have priority. Same logic as above applies. 
if not Ql.IS EMPTY(WRITES) then 
SERVE Q1(TRACE2); 
MAIN MEMORY := TRACE2; 
eresi1 6 not Q0.IS EMPTY(READS) then 
SERVE Q0(TRACE2) ; 
MAIN MEMORY := TRACE2; 
else 
if MAIN MEMORY.TRACE LINE.CODE = ‘0’ then 
BLOCKED_REGISTER(HEX TO INTEGER(MAIN MEMORY. 


TRACE_LINE.TARGET_REGISTER)) := FALSE; 
end if; 
MAIN MEMORY.TRACE LINE.CODE := ’ /; 
MAIN MEMORY.TRACE LINE.ADDRESS ;= " (EMPTY)"; 
end if; - 
ena if; 


else 
—-— A single MAQ is used 
if not MAQ.IS EMPTY (REQUESTS) then 
SERVE _MAQ(TRACE2) ; 


end if; 
ena if; 
end LM2; 
procedure ISSUE INSTRUCTION (HIT RATE pum .out FLOAT; 
TRACE :in out MAQ RECORD; 
VIEW :in out CHARACTER) is 


eGo! 


CACHE VALUE [SELOATS? °=90 20 7 


HOLD, — 
EVENT :EVENT RECORD; 
MISS REGISTER : INTEGER := 0; 
PAUSE : CHARACTER; 
TRACE2 :MAQ RECORD; 
CHOICE : INTEGER; 
R1,R2 »STRING (1202)- 
LM *STREING (ie ee in 
to >STRING (1.52) = 11 
N : INTEGER; 
RANDOM DIQ *FLOAT; 
begin 


if DIQ USED = ’y’ then 
DIQ FETCHED ;:= FALSE; 
SERVE DIQ(TRACE) ; 
if DIQ FETCHED = FALSE then 
-- An instruction is issued from the address trace file 
if not END OF FILE(INPUT FILE) then 
DO LINE PARSING(INPUT FILE, TRACE REC); 
VIEW TRACE LINE (TRACE) ; as 
if TRACE.TRACE LINE.CODE = ’2’ or TRACE.TRACE LINE.CODE = ‘3’ 
RECORD COUNTER := R=CORD COUNTER + 1; o 
end if; — 
end if; 
end if; 
else 
-- An instruction is issued from the address trace file 
if not END OF FILE(INPUT FILE) then 
DO_LINE PARSING(INPUT FILE, TRACE REC); 
VIEW_TRACE_LINE (TRACE) ; 7 
if TRACE.TRACE LINE.CODE = ’2’ or TRACE.TRACE LINE.CODE = ’3’ 
RECORD COUNTER := RECORD COUNTER + 1; - 
end if; 7 
end if; 
end if; 


-- checking to see if source registers of the fetched instruction 
-- are blocked or waiting for memory access 
BLOCKED ;= FALSE; 
if TRACE.TRACE_LINE.SOURCE1 REGISTER /= BLANK then 
CHECK_BLOCKED_REGISTER(TRACE.TRACE LINE.SOURCE1_REGISTER) ; 
end if; 
if TRACE.TRACE LINE.SOURCE2 REGISTER /= BLANK then 
CHECK_BLOCKED_REGISTER(TRACE.TRACE LINE.SOURCE2_ REGISTER) ; 


end if; 
Ril := TRACE.TRACE LINE.SOURCE1 REGISTER; 
R2 := TRACE.TRACE LINE.SOURCE2_ REGISTER; 


-- If the instruction is dependent on blocked data the system 
-- stalls until the data is available. 
if BLOCKED then 


-- Rl and R2 are the source registers. If source registers are 
-- used, they must be checked. 
if DIQ USED = ’y’ then 

ENTER_DIQ(TRACE) ; 
else 

-- if Rl and R2 are used in the instruction 

if (Rl /= BLANK) and (R2 /= BLANK) then 

-—- serve the MAQ until the data is available 
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then 


then 


while BLOCKED REGISTER (HEX _TO_INTEGER(R1)) or 
BLOCKED_REGISTER (HEX_TO_INTEGER(R2) ) loop 


-—-— serve the next Im event to see if it has the dependent 
-—— data. Effects a stall equal to the Im priority value 
—— minus the current execution time. The priority value 
-— of a Im event means that is the time the data will be 
- available for use. 
SERVE EEO (HOLD) ; 
if VIEW = ‘y’ then 
PUT ("Process stalled for blocked memory request.."); 
mei (HOLUD PRIORTIY=EXEGUTE TiIME}),WIDTH => Loy 
PUT LINE(" cycles elapsed.”"); 


end if; 
BESHOED EVENT TD = "im" then 
LM2; 
EXECUTE TIME := HOLD.PRIORITY; 
end if; 
end loop; 
end if; -—— Ril not blank and R2 not blank 


—— if Ri is used and R2 is not used 
if (Ri /= BLANK) and (R2 = BLANK) then 
while BLOCKED REGISTER (HEX TO _INTEGER(R1)) loop 
SERVE PEQ(HOLD); 
if VIEW = ’y’ then 
PUT ("Process stalled for blocked memory request.."); 
Mu (Ao bD, PRIORITYORXECUIE TIME),WIDTH => 1); 
HUG eoOUNE (ameveles elapsed. '), 


end if; 
if HOLD.EVENT ID = "Im" then 
LM2; _ 
EXECUTE TIME := HOLD.PRIORITY; 
Saaetes oo 
ena loop; 
end if; —— R1 not blank and R2 is blank 


-- if R1 is not used and R2 is used 
if (R1 = BLANK) and (R2 /= BLANK) then 
while BLOCKED REGISTER (HEX_TO_ INTEGER (R2)) loop 
SERVE | PEQ(HOLD) ; 
if VIEW = ‘y’ then 
PUT("Process stalled for blocked memory request.."); 
PUT ( (HOLD.PRIORITY—EXECUTE TIME),WIDTH => 1); 
PUT LINE(" cycles elapsed."); 


end if; 
if HOLD.EVENT ID = "Im" then 
LM2; ae 
EXECUTE TIME := HOLD.PRIORITY; 
end if; 
end loop; 
end if; -- Rl blank and R2 not blank 
end if; -- If/then else DIQ used 
end if; -- if blocked 


-- The following statements processes the fetched instruction as 
—- as a store command (code = 1). An Im event is entered into the 
—- PEQ with a priority value equal to the current execution time 
—-- plus the number of cycles required for a write to main memory. 
--— Since the instruction is a main memory request, it is entered 
-- into the MAQ. 

Mea RaAch. TRACE LINE.CODE = “1”) then 


EO; 3 


TOTAL PENALTY := TOTAL PENALTY 4422 NGx 
EVENT.EVENT ID Lie 
EVENT.PRIORITY := EXECUTE TIME + LATENCY; 
ENTER PEQ(EVENT) ; 
TRACE.TIME VALUE := WRITE PRI; 
-- if using seperate queues for loads and stores, put into 
-- store queue (Ql). If Ql is empty then put into main memory. 
if SEPERATE Q = ’y’ then 
if Ql. SIiZ& = Ol. LENG TA IOR (We) Seren 
SERVE PEQ(HOLD); 
if VIEW = ‘y’ then 
PUT("Process stalled for blocked memory request..."); 
PUT ( (HOLD.PRIORITY-EXECUTE TIME),WIDTH => 1); 


PUT LINE (i cVeles se ler sea.) 


end if; 
SERVE Q1(TRACEZ2) ; 
MAIN MEMORY := TRACE2; 
EXECUTE TIME := HOLD.PRIORITY; 
end if; 
if MAIN MEMORY.TRACE LINE.CODE = ’ ’ then 
MAIN MEMORY := TRACE; 
else 
ENTER Q1 (TRACE); 
end if; 
else 


if MAQ SIZE = MAQ. LENGTH OF (REQUESTS) then 
SERVE PEQ(HOLD); 7 
if VIEW = ’y’ then 
PUT ("Process stalled... .MAQ full. senving Frequcs.. eee 
PUT ( (HOLD. PRIOR? TY —EXEGOTE 7 IME WIDTH => 1); 
PUT LINE(" cycles elapsed. "); 
end if; 
EXECUTE TIMBs:—= HOLD, PRIORITY: 
SERVE _MAQ(TRACEZ2) ; 
end if; 
ENTER MAQ (TRACE) ; 
end if; 
-- Since this is Not a load instruction, the load dependency 
—=— Switch 2S j60r ned. or nr 
LOAD_SWITCH := FALSE; 


-- for viewing program execution 
if VIEW = ’y’ then 

PUTSLINE ("WRITE ADDED) 
end if; 


-- Processes the fetched instruction as a load instruction. The 
-- load is either a cache hit or miss. This is determined by the 
-- hit-rate input by the user. A random number generator produces 
-- a value betwee 0.0 and 100.0. If the number generated is greater 
-- than the hit-rate, then the load is a miss. 
-- Sorry, but that’s the best I can do without a cache simulator. 
elsif (TRACE.TRACE LINE.CODE = ’0’) then 
-- Since this is a load instruction, the next instruction fetched 
—- must be checked for dependency on this load statement, therefore, 
-- the load destination register is identified and the load switch 
-—- is set to alert the processor to check ene next 27se uce1or, 
LOAD REG := HEX TO INTEGER (TRACE.TRACE LINE.TARGET REGISTER} ; 
LOAD SWITCH TRUE; a — 
CACHE VALUE NUMBER_IN RANGE (0.0, 100.0); 
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-- the following statements processes the load statement as a cache 
—-- miss. An im event is enqueued to the PEQ with a priority value 
—- equal to the current execution time plus the time it takes to 

—— to load from main memory. An entry must also be placed into MAQ. 
if CACHE VALUE > HIT RATE then 


TOTAL PENALTY := TOTAL PENALTY + LATENCY; 
EVENT.EVENT ID := LM; 

EVENT.PRIORITY := EXECUTE TIME + LATENCY; 
ENTER PEQ(EVENT) ; 

TRACE.TIME VALUE := READ PRI; 


——- Put into appropriate MAQ 
if SEPERATE Q = ’y’ then 
if QO SIZE = QO0.LENGTH OF (READS) then 
SERVE PEQ(HOLD); + 
Mmeevinw = “y’ then 
PUT("Process stalled...Write Queue full..."); 
PUT ((HOLD.PRIORITY-EXECUTE TIME),WIDTH => 1); 


PUT LINE(” cycles elapsed. "); 


end if; 
SERVE_Q0 (TRACE2) ; 
MAIN MEMORY := TRACE2; 
EXECUTE _TIME := HOLD.PRIORITY; 

end if; 

if MAIN MEMORY.TRACE LINE.CODE = ’ ’ then 
MAIN MEMORY := TRACE; 

else 
ENTER QO (TRACE) ; 

end if; 

else 


if MAQ SIZE = MAQ.LENGTH OF (REQUESTS) then 
SERVE PEQ(HOLD); < 
if VIEW = ’y’ then 
PUT("Process stalled...MAQ full..."); 
PUT ( (HOLD. PRIORITY-EXECUTE TIME) ,WIDTH => 1); 
PUT LINE(" cycles elapsed. "); 
end if; 
SERVE MAQ (TRACE2) ; 
EXECUTE TIME := HOLD.PRIORITY; 
Scat.) 
ENTER MAQ (TRACE) ; 
end if; 
if VIEW = ‘y’ then 
PUT LINE("READ MISS ADDED"); 
end if; 


— a ap SS metegy  cm cemcme cmc cm cmc cc mc ms eae Se 


-- If the load is a cache miss then the process continues as a 
—- non-memory access. An event ii is enqueued at the end of 
maeene ISSUE INSTRUCTION procedure. 


else —— READ HIT 
MePtAL CYCLES :* TOTAL CYCLES + 1; 
end if; 


— ee ee ce ce cc cs cc sc cc ce cs cs es cs sw ee ee ee ee es 


-—- The following statements simulates a processor stall fora 


—- branch instruction (code = 3). The execution time is incremented 
—— by the amount of branch penalty previously specified. 
elsif (TRACE.TRACE LINE.CODE = ’3’) then —-—- BRANCH instruction 
mena CYCLES >= TOTAL CYCLES + BR CPI; 
EXECUTE TIME := EXECUTE TIME + BR CPI - 1; 
LOAD_SWITCH := FALSE; — - 


—_— cc cc cc ccs ecm comm com sm ee ec sc i i a ee 


105 


—_——_— —>= ao 


-- The code = 2 presents a one-cycle-execution statement. Therefore, 
-- the next instruction can be executed at execution time + l. 
-- This is handled at the end of the II procedure. 


else --— (TRACE.CODE = ’2’) 
TOTAL CYCLES := TOTARTCYCLES 1, 
LOAD SWITCH := FALSE; 

end if; 


mm mm mc cc ew cc cc ew ce cs cs cm we ec wy ce ce cw cm es ws we we we we oe ee ee es Se 


-- After processing every 11 event, another ii event is put into the 
-- PEQ with a priority value of the current execution time + 1, which 
-- means the next instruction can be fetched on the next clock cycle. 
if not END OF PILE( INPOR IE ILE) then 

EXECUTE TIME := EXECUTE_TIME + 1; 

EVENT.EVENT ID := II; 

EVENT.PRIORITY := EXECUTE_TIME; 

ENTER_PEQ(EVENT); 

end if; 


This procedure processes the lm (leave memory) event. It basically 
takes the appropriate item from the appropriate MAQ. This simulates 
that the request has completed its main memory access and is available 
for use. When the item leaves main memory, the next item enters. 


procedure LEAVE MEMORY is 


begin 


-- If seperate queues are used, then either the load or the store request 
-- has priority to enter memory next. If the loads have priority, then 
-- the load queue (Q0) is served. If Q0 is empty, then the write queue 
-—- is served. If both queues are empty, then there are no main memory 
—- requests, and no requests are currently in main memory. 


—- if loads have priority over stores. 


if SEPERATE Q = ’y’ then 
if READ PRI < WRITE PRI then 
if not Q0.IS EMPTY(READS) then 
SERVE Q0 (TRACE) ; 
MAIN MEMORY := TRACE; 
elsif not Q1.IS EMPTY(WRITES) then 
SERVE Q1 (TRACE); 
MAIN MEMORY := TRACE; 
else 
if MAIN MEMORY.TRACE LINE.CODE = ’0’ then 
BLOCKED REGISTER(HEX TO INTEGER(MAIN MEMORY. 


TRACE LINE.TARGET REGISTER)) := FALSE; 
enc. a : 
MAIN MEMORY.TRACE LINE.CODE := ’ ‘’; 
MAIN MEMORY.TRACE LINE.ADDRESS := " (EMPTY)"; 
end if; - 


-- if stores have priority over loads 
else 
if not Q1.IS EMPTY(WRITES) then 
SERVE Q1 (TRACE) ; 
MAIN MEMORY := TRACE; 
elsif not Q0.IS_EMPTY(READS) then 
SERVE Q0 (TRACE) ; 
MAIN MEMORY := TRACE; 
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else 
if MAIN MEMORY.TRACE_LINE.CODE = ’0’ then 
BLOCKED REGISTER(HEX TO INTEGER(MAIN MEMORY. 


TRACE LINE.TARGET_REGISTER)) := FALSE; 
ena tr; 
MAIN MEMORY.TRACE LINE.CODE := ’ ’; 
MAIN MEMORY.TRACE LINE.ADDRESS := " (EMPTY)"; 
end if; 


edad 1f; 
-—- if only one queue is used for loads and stores, then the 
—— next item in the queue is served (enters main memory). 
else 
perenOt MAQ.IS_ EMPTY (REQUESTS) then 
SERVE_MAQ (TRACE) ; 
ena if, 
end 1f; 
end LEAVE_MEMORY; 
-—— This procedure serves items from the PEQ and processes them. 
procedure PROCESS REQUEST is 


PEQ HOLD :EVENT RECORD; 

i “STRING (Lave) 2= “ti. 

LM SSTRENGI LS, 2) ¢= Clim’; 
begin 


SERVE_PEQ(PEQ HOLD) ; 

if PEQ HOLD.EVENT ID = II then 
ISSUE_INSTRUCTION(HIT_RATE, TRACE, VIEW) ; 
| elsif PEQ HOLD.EVENT_ID = LM then 
) LEAVE_MEMORY; 

if END OF FILE(INPUT FILE) then 
. EXECUTE TIME := PEQ HOLD.PRIORITY; 
ena if; - 

else 
| PUT LINE("UNKNOWN EVENT. ..IGNORING") ; 
| end if; 


end PROCESS REQUEST; 


mee KKK HH KKH HAHA KKK AHA KKH KAKA KKH AKA HHH HK HHH EK HKA KK HK AK KKK KKK AAHAKAKEKESHN 


—* main prograM BEGINS HERE !!f!fiittri! a] 


pee KKK KKKKKKKKKKKKKKKKEHAKKKKKEKKKKKKKKKKKK KE KRAHKKK KK KRKKRKAKKEAKAEKRKAKAKAARERKE 
i ° ’ 
begin--main procedure 


CLEARSCREEN; 
mwhile ANOTHER = ’y’ loop 
GET INPUT FILE(INPUT FILE); 
if SAME DATA = ‘n’ then 
GET INITIAL DATA; 
end if; pm 
while not PEQ.IS EMPTY (WAITING) loop 
| PROCESS REQUEST; 
INTERVAL CHECK (INTERVAL,MAQ LENGTH) ; 
if (RESPONSE = ’n’) then — 
NEW LINE; 
PUT LINE("PROGRAM TERMINATED !!! "); 
end if; 
exit when (RESPONSE 
if VIEW = ’y’ then 
VIEW MAOQ; 
NEW LINE; 


Carlee) as 
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VIEW PEQ; 
NEW LINE; 
VIEW DIQ; 
NEW LINE; 
end 267 
end loop; 
NEW LINE; 
VIEW QUEUES; 
CLOSE (INPUT_FILE); 


PEQ.CLEAR (WAITING) ; 

Ol, CLEAR (WRITES), 

Q0.CLEAR (READS) ; 

MAQ. CLEAR (REQUESTS) ; 

EXECUTE TIME := 0; 

RECORD COUNTER := 0; 

RESPONSE := ‘y’; 

NEW LINE (2); 

PUT("Do you want to do another simulation? (y/n): "); 
GET (ANOTHER) ; 

NEW LINE; 

PUT("Keep same parameters? (y/n): "); 
GET (SAME DATA) ; 

NEW LINE; 

SKIP_LINE; 

end loop; 


end SPLIT; 
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APPENDIX E. SPA RESULTS OF MATRIX MULTIPLICATION TRACE 


Spanner —- Sparc performance analyzer 
Cpu: ey cool 
cache: 3s2 
register windows: 8 
overflow cost: 170 cycles 
underflow cost: 110 cycles 


OVERALL 


instructions 
annulled delay slots 
load-use stalls 

ean cycles 

window handlers 
cache cycles 


overall (%) 
cycles inst. 
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cycles count 
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cycles inst. 
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raw 
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raw 
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category (%) 
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overall (%) 
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category (%) 
cycles count 
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sub 
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overall (%) 
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raw 


ee mmm rr ee mm me ee me ec ee ee eee eee ma ea i ee 


LOGICAL 


and 
andccec 
andn 
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Or 
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count 
Z03 
16263 
oe 
8060 
27328 
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SHLET 


left 
erat logical 
right arithmetic 


overall (%) 
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category (%) 
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MULTIPLY 


Single step 
read y 
write y 


Gi. 7 8.0 
lee er 
sas oo 
i238 rite 2 


overall (%) 

cycles inst. 
16.6 202.0 
ae LS 


category (%) 
evycles count 
8 Ona S67 


cycles 
LOa7T55 
8059 
8059 


raw 


count 
104755 
8059 
8059 


overall (%) 
eveles inst. 


C7 Gad 
651 627 
010.0 100-0 


category (%) 
eveles count 


raw 


CONTROL TRANSFER 


Sonaitional branch 


unconditional branch 


jmp 1 
call 


29 a) 
Or 0 0.0 
ES Se 36) 


overall (%) 
cyclés*inst . 


2859 Bro 
ve eb 
HOO..0 L000. 0 


category (%) 
cycles count 


raw 


COND. BR.: MBS6901 


backward taken 
backward untaken 
forward taken 
forward untaken 


overall (%) 
eveles inst. 


46:.0 556 
14...7 Gree 
Zo 14.8 
Zo ae 
foo. 0- LO0.0 


category (%) 
eycles count 


26 =O 
0.6 Oe 
ea od 347 
pa .0 So 


cycles 
io 
306 
10776 
36688 


raw 


cc ccm cr ccm crm crm cc cr cm crm ce cm i ee a ee a SS ee Se ee SS ee a ee ee Se Se ee ss ee ee 


@enD. BR.: CY7C601 


backward taken 
backward untaken 
forward taken 
forward untaken 


023 O..3 
0.0 0.0 
ee! aed 
0 S05 
8 Sey 


overall (%) 
GYVGIeS Arise. 


category (%) 
evc les count 


5.6 D8 
On5 02 
34.7 34.7 
ao ad oe 


raw 


cc cc cc co cc cs cc ccc cm cm cc ec cs ce cme ee es ee ee ce i ae SO Se ee ee Se ae Se ae 


JMPL 


gael (jmpl) 
ret 

retl 

jmp 

other jmpl 


O23 On 
O70 G:.0 
ed Zyl 
eo Sa 
4.9 ao 
overall (%) 
cycles inst. 
0-0 G...0 
O50 G50 
et 1276 
O20 ..0 
0.0 veo 


category (3%) 


evcles count. 


Get OL 
O38 Ons 
995 232 
0.0 O70 
0.0 O20 


nel 


raw 


OTHER INSTRUCTIONS 


save 

restore 

ticc untaken 
other 


overall (%) 
cycles inst. 


FOO.0 10020 


category (%) 
cycles count 


raw 


TRAE CLES 


overflow trap 
underflow trap 
system call trap 
other traps 


0.0 Ore 
0.0 Cee 
Lie 354 
0.0 O70 
Zo ee 


overall (%) 
cycles inst. 


iso 2 
1.4 1.4 
Olek OMe 
O70 0.0 
10070 LOGE 


category (%) 
cycles count 
14.3 Lae3 


is? 1.9 
C8 136 
0.0 0-9 


raw 


WINDOW HANDLERS 


overflow 
underflow 
flush 


O20 0.0 
0.0 0.0 
0.0 Oro 
G50 0.0 
0530 0-0 


overall (%) 
cycles inst. 


category (%) 
cycles count 


65.0 54.5 
So.0 45.5 
0.0 0:0 


raw 


WINDOW SIZES 


trace 

2 windows 
3 windows 
4 windows 
5 windows 
6 windows 
7 windows 
8 windows 
9 windows 
10 windows 
11 windows 
12 windows 
13 windows 
14 windows 
15 windows 
16 windows 


OF 0.0 
OF 1 O50: 
020 0720 
0) 0.0 


overall ($%) 
cycles inst. 


0.0 0.0 
Pee 0.1 
Se 0-0 
Bid 0.0 
ocd 00 
On6 0) <;0 
0.4 0.0 
0Q.2 Or.0 
On. O20 
O20 0.0 
O70 O79 
0.0 O20 
0.0 O70 
0.0 Oro 
0.0 0.0 
Oa O20 


100-05 10070 


category (%) 
cycles count 
O20 0.0 
10070 ~~ 10C20 
43.5 43.4 
Zao 27.4 


14.4 14.3 
4.8 4.7 
Sa 3.0 
Ze 20 
eo tat 
0.4 0.4 
020 0.0 
0.0 0.0 
0.0 O70 
0.0 0.0 
0.0 O-0 
0.0 0.0 


raw 


SS SD MN MUS ey my SS SS A mmm SS ce a ca yc me em em ce ee ee es eee eee 


CACHE SG Y@Er os Soe 


I-read miss 
D-read miss 
D-write miss 
write buffer stalls 


overall (%) 
evcles Anse. 


category (%) 
cycles count 
49.0 24.7 

ZS 14.8 

5). 7 50.4 
i 1 


raw 


Se ce ce cm cm ce ecm ce es es ee es es ees es es es ee se i es es ce ee ee ee oe eee ee 


Zare Cee 
ales Oat 
Ore 0.4 
Or O-.1 
4.4 Oo 


a IeZ 


PAGHE CYCLES: SS2 overall (%) category (3%) raw 


cycles inst. cycles count cycles Count 
I-read miss Zo O 2a 4377 poe 15492 633 
D-read miss De Oot 27.4 Ze 9724 400 
D-write miss 6 0.4 ZB 5 6542 10101 2080 
write buffer stalls Ono 0.0 O53 2.4 lee ao 
HocGa l D6 U6 HOORO 0 L00,0 35430 S38 
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APPENDIX F. 


OVERALL 


instructions 


annulled delay slots 


load-use stalls 
trap cycles 
window handlers 
cache cycles 


cycles 
466217 
4053 
27914 
564 
6050 
104153 


SPA RESULTS OF PSEUDO CODE TRACE 


raw 


mm mmm cr cm cm cm mm cc cm cm i cc ec cc cs cm ec ee ee ee ee ee eee 


INSTRUCTIONS 


memory access 

alu 

floating point 
control transfer 
other instructions 


608951 


cycles 
147185 
216411 
2390 
85062 
V5i6 2 


raw 
count 

62332 

216411 

2397 

80022 

15162 


MEMORY ACCESS 


load 
store 
atomic 


Spanner - Sparc performance analyzer 
cpus ey 7c60l 
cache: ss2 
register windows: 8 
overflow cost: 170 cycles 
underflow cost: 110 cycles 

overall (%) category (%) 

cycles inst. cycles count 
Tee6 —lOWO 7.60 - 
Or ed Ona - 
4.6 7 <4 Ar26 = 
Oreck 0.0 Ost - 
eso Ceo dao = 
eee ne et - 

100.0 = LO020 = 

overall (%) category (%) 

cycles inst. cycles count 
Zac 6. 6 SLs 16. 6 
35.5 ees 46.4 So 
0.4 0.6 03.5 0.6 
14.0 Ze 3 ere AS 
Z55 4.0 S5 4.0 
76.6 10020 LODZ OF TOO .G 


overall (%) category (%) 


raw 


cycles inst. cycles count 
LS0 az 6221 7 ee 
OeyZ 4.8 3759 ono 
Oo 020 0.0 C0 
24.2 1676 100.0 10070 


raw 


ce cc cc ccm ccc ce cr ee ee ee ee ce es a eee ee ee ee ee ee ee a ee 


overall (%) category (%) 
cycles inst. cycles count 
56 4.5 So 8 387.0 
Oe 1 CO 0.6 0.6 
Tee 6.9 SZiae 5219 
Oat O30 0.4 0.3 
ie O.1 Ot Om 
2 Ong 8.8 620 
52.0 LS LO020 TOGO 


dei4 


overall (%) 
cycles “inst. 


category (%) 
cycles count 


20 ak 2Oreo 
ek Ort 
55.6 Sur > 
Zoe, 1. 1.6 
lea ee 
sO yest dex 


raw 
COunmEe 
a2 5 
25 
OB o9 


arithmetic 
logical 
ShHALEt 
multiply 
sethi 


276 1.4 
0.0 0.0 
5.4 9 
OZ Om 
OF Oat 
O20 0.4 
BZ 4.8 
overall (%) 
cycles inst. 
13e Zao 
dvi Za 6 
<3 2.0 
Pao 3.50 
dee? 23:8 


category (%) 
cycles count 


BS.00 38.5 
47.9 47.9 
S25 So 
Sa DeaZ 
4.8 4.8 


cycles 
83343 
LOsS743 
7644 
11278 
10403 


raw 
count 

83343 

103743 

7644 

TiZ7e 

10403 


ARITHMETIC 


add 

addcc 

addx 
addxcc 

sub 

subcce 

subx 
subxcc 
taddcc 
taddcctv 
tsubce 
ESwocctv 
emp (subcc) 
tst (subcc) 


overall (%) 
Cycles inst. 


category (%) 
cycles count 
24.1 24.1 


4) 
Oo 
an 
Oo 


br 
OP ODODODODONNOC OS 
DHNODVODODOOO0OWON 


re 


OP OOOOOONNCOC SO 
MHMOODTOOO O00 0WOhN 


s 


it 
1 


216411 


cycles 
20126 
4915 
89 

0 

1947 
10003 


raw 


LOGICAL 


and 
andce 
andn 
andncc 
or 

Orcc 
Orn 
Sorncc 
xO 
sOrcc 
<Orn 
xornce 
mov (or) 
tst (orcc) 


category (%) 
cycles count 
2.0 210 


od 


mBPOOoOdOrFROVWOFR FOO UI 
pS 


HDHROOOMODOO0OFR ~IWND 


mBooodrc0o0orcec”orrLood ui 
HHODVOMODOO0FR ~TIWN 


cycles 
Zee 
5428 


TS 265 
4757 


raw 
count 
ZZ 
5428 


T2035 
4757 


325 Pe 
ORS 3 
0.0 Or 1 
O50 0): 0 
0.3 Or. SD 
6 a 
0-0 0.0 
0.0 0% © 
O20 Cr.0 
0.0 O10 
v0 0c. 0 
0:20 07.0 
62 ono 
lo 2.4 
Lests / Zee 1 
overall ($%) 

cycles inst. 
U3 0.6 
O29 1.4 
0.0 Onc. 
0.1 OnZ 
2.4 7 
OZ O23 
0.0 0.0 
O'.-0 0.0 
OZ 0.4 
OPO O20 
0.0 0.0 
O:0 07.0 
72-0 1335 
Ors eS 
17 27.6 


1a, 


103743 


103743 


SHE Tt 


Tere 
Cig lOogicauw 
right arithmetic 


overall (%) 
cycles inse- 


category (%) 
cycles count 


raw 


MULTI: 


single step 
read y 
write y 


Or28 era | 
0.3 OS 
OZ 0.2 
siaee Zao 


overall (%) 
cycles inst. 


O2.3 62.3 
Zoe 25.4 
Ze 233 
100.0: © EO Oz 


category (%) 
Gycles count 


raw 


mm mm a cme mm mr cy mmm ca cm cr wc cr cm cc crm es ee ee ee we we ee 


6 ZG 
Cat Oe2 
Oa OneZ 
19 570 


overall (%) 
cycles inst. 


86.8 86.8 
6.6 6.6 
67.6 6.26 

10'0%..0 a0 Oro 


category (%) 
cycles count 
567.6 86.6 
13.4 13.4 


raw 


ee ee ee me me ee ee me ree Se ren cm me ee ce ee ee ce ee ee ei ee ey ee ee ee ce ey ee ee ee ee 


CONTROL TRANSFER 


conditional branch 
unconditional branch 
jmpl 

call 


3 2.4 
OZ 0.4 
Le? Zia 


overall (%) 
cycles inst. 


LOZ L635 
E38 PLN 
ed eS 

Oss a3 


LOO). 0" 1000 


category (%) 
evicles count 


ere.) leo 
9.4 LOEO 
eo Gre 
529 6i..2 


raw 


ys wr rr cm wm crm cm ee ee cc ce ee ee ee ee 


COND] BR.: MBS6 708 


backward taken 
backward untaken 
forward taken 
forward untaken 


overall (%) 
cycles ins: 


14 23 
One Oe 2 
3.4 oo 
OS oe 


category (%) 
cycles count 


raw 


ee ey ee ey ee ee ee ee ee ee ee ey ee ee ee ee ee ee ee ee ee ee ee ee ce ee ee ee ee ee ee ee ee ee ee em ee ee eee 


COND. BRa:YCrY e601 


backward taken 
backward untaken 
forward taken 
forward untaken 


overall (%) 
evycles ainse. 


1.4 ae,5 
Oz. OF 2 
3.4 520 
D3 SiS 


S29 jb a a 
AS: eZ 
Zo 334.0 
67:16 5 lie 
100-0 200720 


category (%) 
cycles count 


AD af oe 

2s ee 
SOD oo 
oye, Sl. 


raw 


mm ee mm Se a a ee es cme me Se wc em co ee ee ee ee ee ee es ee ee a ee ee eo eee eee ee ee 


JMPL 


call (amp) 


overall (%) 
cycles inst. 
a 1) 0.0 


category (%) 
cycles count 
OZ Ou2 


PAG 


raw 


ret 

retl 

jmp 

other jmpl 


O:.3 O..3 
68 ./ 88.7 
0.4 0.4 
O03 Ors 


=> Gee Gee Gee Gee Gee ee > ee ee wr ce ce ce ts ce ce ce i ce ce i ee ce ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee em om om om em a ea 


OTHER INSTRUCTIONS 


save 

restore 

ticc untaken 
other 


Ore Oe 
iDeree, ee 2 
O20 0.0 
O01. 0 0<.0 
ee ee 


overall (%) 
Gyeles Vins. 


category (%) 
cycles count 


ES ee i are 7 
eG isi Lee 7 
62.7 625.7 
0.0 0.0 


raw 


Peer CYCLES 


overflow trap 
underflow trap 
system call trap 
other traps 


O29 7 
025 0.8 
Po 200 
0.0 O20 
2 4.0 


overall (%) 
eycles anst-. 


category (%) 
cycles count 


526 Ie 
14.9 14.9 
62.2 6915 
0.0 00 


raw 


WINDOW HANDLERS 


overflow 
underflow 
flush 


020 0.0 
G20 0.0 
On O20 
Or0 0.0 
O28 0.0 


overall (%) 
cycles inst. 


category (%) 
cycles count 


raw 


cmc cc ccm crm mcr crm rm mmm mmm mm cm em cm mm ms em we ee ee ee ee ee ee ee ee eee ee ee 


WINDOW SIZES 


trace 
2 windows 
3 windows 
4 windows 
5 windows 
6 windows 
7 windows 
8 windows 
9 windows 
10 windows 
11 windows 
12 windows 
13 windows 
14 windows 
15 windows 
16 windows 


O26 20 
0.4 0.0 
O50 O20 
i.0 0.0 


overall (%) 
Cyclesoinst. 


0.0 0.0 
13052 eo 
AD. 2 06 
21.6 O23 
ES. 6 On 
eo O.1 
1.4 Oro 
0 0.0 
O27. 0.0 
Us, 0.0 
0.4 0.0 
0.4 0.0 
Oe 020 
O33 O20 
O22 v0 
C22 O20 


61553 oi2 
Se 48.8 

OFC 20 
TOC 20] 100 .0 


category (%) 
cycles count 


O20 0.0 
le0r0 ~ 00.0 
ae6 32.3 
2152 21 e2 
2ee 1 Ze! 
One Grt 
oe liga 
0-8 One 
Oo Ox 
0.4 0.4 
Ors 073 
Os: Os 
OR 2 Oe2 
OFZ 0.2 
Oe 2 On 
OE Or 


cycles 
0 
792920 
299350 
168230 
96270 
48390 
8460 
6050 
4200 
3080 
2520 
2240 
TI6G0 
1680 
1400 
1520 


raw 


BACHE CYCLES; SS1 


I-read miss 
D-read miss 


overall (4%) 

eveles inst. 
6.2 0.8 
2.16 Oines 


category (%) 
cycles count 
454.2 eo 
ons 82.3 


ele 


cycles 
S560 
az 


raw 
count 

2130 

ie a 


D-write miss Zi 1.6 14.8 38.9 2250 6125 
write buffer stalls Zo ee Ane S29 17518 Swe 
total 1326 4.2 1003.0. EOGzC 83050 15738 
CACHE CYCLES: SS2 overall (%) category (%) raw 

cycles taste cycles count evycies count 
I-read miss 8.4 OES 49.1 2 Sito 2085 
D-read miss 4.3 es 24.9 1 2a oG3 1068 
D-write miss 4.3 eS 24.9 59.7 25931 5497 
write buffer stalls Dez Orel lO 6.1 NOS 2 262 
total Le) 2.4 1002.0, -'CG 30 104133 9212 
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APPENDIX G. SPA RESULTS OF TRAJECTORY PROGRAM TRACE 


Spanner —- Sparc performance analyzer 


epu: ev /ico0l 
cache: Ss2Z 
register windows: 8 
overflow cost: 170 cycles 
underflow cost: iMmCrmevycles 


KIRK KKK KKK KEK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK 


* 


* 


BeVIeSRNING: More than 1% of instructions are floating point instructions. * 
- Spanner does not simulate the floating point pipeline. 


* 


* 
* 


KAEKKKKKK KKK KKK KKK KKK KKK KEK KK KKK KKK KE KEK KK KKK KKK KKK KK KK KKK KKK KER KEK KKK KRKKKAKKKK 


OVERALL 


instructions 


annulled delay slots 


load-use stalls 
trap cycles 
window handlers 
cache cycles 


overall (%) 
eveles* inst. 
Tae SLO0. 


category (%) 
cycles count 
7a AN - 


cycles 
598746 
34 31: 
26141 
Gg 2 
11540 
134443 


raw 


INSTRUCTIONS 


memory access 

alu 

floating point 
control transfer 
other instructions 


Overall (%) 
cycles anst < 


Pasha) 14.0 
S729 Sue 0 
O29 104 
14.8 21.8 
ilies) 2 


category (%) 
eyeles count 


2873 14.0 
hors 1 60.0 
deb iv 4 
Os 2 2.8 
22 Zeal 


Tie 2 53 


cycles 
169647 
295516 

6673 
PES 3 
13427 


raw 
count 

68638 

293876 

6673 

106833 

3427] 


= oe a a ee ee ee ee ee ae, a a eS a eS eS SS ce ee ee es ee ee es es ee es es a ee a ee ee 


MEMORY ACCESS 


load 
store 
atomic 


overall (%) 
cycles inst. 


223 oF. 
0 a .9 
O20 O70 


category (%) 
cycles count 


5998746 


cycles 
25165 
74475 


raw 


—_—— Se ee ee eo cm me Ss Sm a sm we we a Se ee ee se i 


overall (%) 
eveles ianSst : 


4.6 5.0 
Od OnaL 
De 4.2 
G0 O20 


=o. 1 Gon< 
a3. 9 34.8 

O70 QO. 0 
EOOs0. 100,.0 


category (%) 
cycles count 
Sis) 3979 


Pes, G26 
42.8 45.5 
OL 0.0 


oo 


169647 


cycles 
35728 
az 
40706 
Sb 


raw 
COuUne 

17864 

256 

20353 

17 


mc yc cr cm cs a ee wc ee cc es ec cs cc cs cme ee me ee ee oe 


overall (%) 
cycles inst. 


1:00:..0. .10.0:-0 


category (%) 
cycles count 


5.6 30), 1 
Cot Ore 
46.4 48.2 
0.33 O43 
1.4 1.4 
Lee io 


raw 


ALU 


arithmetic 
fequcal 
shift 
multiply 
sethi 


Sa) dee 
0.0 0.0 
4.4 Zoe 
0.0 0.0 
Oe Omer 
2 06 
oe 4.9 


overall (%) 
cycles inst. 


14.5 25.0 
20145 32 
O) 2s: 0.8 
Oia ie 
de? Lh 


category (%) 
cycles count 


S35 SOn 
Sa 54.1 
1.4 1.4 
1.8 re 
4.4 4.4 


raw 
count 

112649 
158945 

4070 

Sl 3 

PSUs2 


ARITHMETIC 


add 

addcc 

addx 
addxcc 

sub 

SUDCG 

subx 
subxcc 
taddcc 
taddcctv 
tsubcc 
ESsubecey 
cmp (subcc) 
tst (subcc) 


overall (%) 
cycles inst. 
ae 4. 


eocoocoo con © 2 @ 
Hr~wA7ovoc0ncoodor+4+ L000 -] 
Nooo ooo OWO Coo Ff 
WIMmMooTeTWeOOOAORnOOWN 


category (%) 
cycles count 
18.4 18.4 


Spree: 26.0 
0.0 0.0 
0.0 0.0 
Zio 250 
16.4 16.4 
0.0 0.0 
020 0.0 
0.0 0.0 
O-o Or 
0.0 0.0 
0.0 0.0 
a. 9 AS a0 
Ft. 0 a0 


293870 


cycles 
Z0GG2Z 
6532 

0 

0 

2032 
16302 


raw 


LOGICAL 


and 
andcc 
andn 
andnecc 
or 
orcc 
orm 
orncc 
“G2 
XOLCE 


overall (%) 
cycles inst. 


On7 ake 
O.-5 O28 
C20 00 
Orer Deel 
aes See) 
Ome 0.3 
C0 O20 
O20 O20 
0.6 0 
Oe 0%0 


category (%) 
cycles count 


4/0 SS 
2.4 2.4 
O21 Oct 
0.2 Nees 
11.4 Lye 
L.0 10 
O20 O20 
0.0 0.0 
0 3.0 
o20 OO 


i 0 


112649 


cycles 
oe 
SH 


520 


18165 
1663 
0 


4785 


112649 


raw 
COUnE 
5323 
3791 
88 
390 
18165 
1663 


4785 


xODrM 
MoOrncc 
mow (Or) 
oie tOrcc) 


0 

0 

20 Or: 
4039 


0 

0 
L200: 
4039 


ee as ee ee ee ee ce ce cs ce ce ee ey ee ee es ee ee ce ee ec cc ee re ce ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee 


SHIFT 


Perec 
erght logical 
Eeone arithmetic 


O20 0.0 
Ve 0 Da 
1 Sis oe 
Oro 028 
20RS B25 


overall (%) 
eyeles ineer., 


category (%) 
cycles count 
mie cae, 36.9 
50.0 50)..0 
Dek 3 


raw 


MULTIPLY 


single step 
read y 
write y 


OR 2 O23 
OS O24 
Oa! Orel: 
Oe) vee 


overall (%) 
cycles inst. 


EOO2O0  10G50 


category (%) 
cycles count 
316 <3 woo 


raw 


0.6 0. 
0:20 Oe 
O70 ie: 
Cz irae 


overall (%) 
cycles inst. 


6.9 o2 
549 Ono 
OO 0 1007.0 


category (%) 
cycles count 
88.8 858.8 
eleeee dee 


raw 


mc crc cm crm cr crm cm ce ce ci te ee mm mm crm crm ec ce cc ce cc ce ce mm Sm cm ee ee cme se SS 


CONTROL TRANSFER 


conditional branch 
unconditional branch 
jmpl 

call 


ee 2.4 
a0ri2 O23 
er Ze) 


overall (%) 
cycles inst. 


1Ox2 Le .2 
1.4 23 
age i 
a 0 NS 


POUZ OO 100. 0 


category (%) 
cycles count 
Soa Te 


cycles 
79264 
dizi 
los30 
8068 


raw 


count 
79264 
eZ ed 
8290 
8068 


COND. BR.: MB86901 


backward taken 
backward untaken 
forward taken 
forward untaken 


overall (%) 
cycles inst. 
oe 
O23 
26 
fee i: 


oor 


Ons OS 
14.4 128 
10) 16 
HOO70. 100...0 


category (%) 
cycles count 
Sy! Oe 
Ze 1.4 
40.2 54.7 
aL. 0 34,7 


SZ 


cycles 
7244 
2286 
43390 
54974 


raw 


— es ee ce ce ce cc cw ce ce ce cm wc cc me ce es ce ce ec ee ee ee ee se 


@eND. BR.: CY/7C601 


backward taken 
backward untaken 
forward taken 
forward untaken 


overall (%) 
eycles inst. 


BOO LO T0070 


category (%) 

eye les <count 
cya Ord 
al al 


34. 34. 


4 4 
54.7 5407: 
7 i 


raw 


—_——— Se Se we ce ce cr co cr cr cr ce ce ee cc ec cs cc cc cr cs es es es ee ee ee ee ee 


9 2 
O.1 2 
6 S27 
Bio a. 6 
BOs, 2 16 


ie 


JMPL 


call (jmpl) 
ret 
retl 
jmp 
other jmpl 


overall (%) 
cycles inst. 


category (%) 
cycles count 


Oat Oot 
Seo Deer 
O27 92.3 
ee eS 
ae eae 


Cyeles 


raw 


OTHER INSTRUCTIONS 


save 

restore 

ticc untaken 
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